title
LABOR MARKET INTELLIGENCE DATA

BY

greenwichHR


Product Structure Tree

Number of Files Received: 5

1. Master

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? 92.82% Unique Rows
Missing values > 50% for one or more columns ? 1 Columns have missing values > 50%
Most recent updates is before 6 months ago ? Latest Date: 30-Mar-2019
Data contain PII (Personally Identificable Information) ?

1.1. Metadata Summary

1.1.1. Data Dimensions

Number of Columns: 21

Number of Rows: 9,948,150

Date Range: 25-Apr-2016 to 30-Mar-2019

1.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

1.1.3. Data Subset

Sample 6 rows of the dataset.

1.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values
1.1.5. Missing Data

The plot shows the percentage of missing values for each column, color-coded on a spectrum from Green(0%) to Red(100%).

1.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

1.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

1.2.1.1 Column - salary

Column salary represents the estimated pay level

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 0 and less than 75000).

1.2.2. Frequency Counts - Categorical Variable(s)
1.2.2.1. Column - company

There are 6193 unique company names

Below table represents the frequency count of top 10 and bottom 10 company names

Top 10
Bottom 10
Company Name Frequency Company Name Frequency
Care.Com 670,514 3M Company 1
Aramark 152,257 3M Health Information Systems 1
Wells+Fargo 132,242 A f l a c 1
Marriott International Inc 111,047 Acadia Healhcare 1
Wells Fargo 110,970 ACADIA HEALTHCARE 1
Express 106,088 Adams Resources & Energy Inc 1
Southern 105,905 adidas AG 1
Care.com 87,558 Advanced Emission Control Solutions 1
TARGET 82,568 Aflac - Capital Region 1
Lockheed+Martin 79,254 Aflac Careers 1
1.2.2.2. Column - location

There are 54050 unique locations

Below table represents the frequency count of top 10 and bottom 10 locations

Top 10
Bottom 10
Location Frequency Location Frequency
New York NY 77,567 65th Infantry PR 00,923 1
Atlanta GA 66,526 Abbeville LA 70,511 1
Seattle WA 65,405 Abbeville MS 38,601 1
Charlotte NC 58,432 Abbott TX 76,621 1
Houston TX 57,550 Abbottstown PA 1
Chicago IL 51,836 Abbyville KS 67,510 1
United States 50,547 Aberdeen IN 1
San Francisco CA 44,994 Abie NE 68,001 1
Minneapolis MN 39,806 Abington CT 06,230 1
Dallas TX 39,739 Abington NC 1
1.2.2.3. Column - city

There are 17370 unique cities

Below table represents the frequency count of top 10 and bottom 10 cities

Top 10
Bottom 10
City Frequency City Frequency
New York 164,137 65th Infantry 1
Houston 108,095 Abbyville 1
Atlanta 103,103 Abie 1
Chicago 102,014 Academy 1
Charlotte 90,206 Adah 1
Seattle 89,611 Adair Village 1
San Francisco 78,337 Adairville 1
Dallas 74,010 Adams County 1
Austin 71,042 Addy 1
Phoenix 66,116 Adelphia 1
1.2.2.4. Column - state_long

There are 52 unique states

Below table represents the frequency count of top 10 and bottom 10 states

Top 10
Bottom 10
State Frequency State Frequency
California 1,066,787 Wyoming 9,978
Texas 725,249 Vermont 12,031
Florida 481,256 Puerto Rico 12,591
New York 423,525 Alaska 13,675
Illinois 346,562 Montana 16,416
Pennsylvania 326,239 South Dakota 16,886
North Carolina 307,983 North Dakota 19,467
Georgia 306,908 Maine 27,760
Virginia 292,574 Idaho 31,023
Ohio 276,539 Rhode Island 31,751
1.2.2.5. Column - county

There are 1837 unique counties

Below table represents the frequency count of top 10 and bottom 10 counties

Top 10
Bottom 10
County Frequency County Frequency
Los Angeles 219,785 Benson 1
King 180,297 Billings 1
Santa Clara 176,393 Broomfield 1
New York 165,407 Camas 1
Cook 161,817 Dundy 1
Orange 149,905 Eureka 1
Maricopa 144,711 Faulk 1
Dallas 141,594 Haakon 1
Harris 130,443 Harding 1
Middlesex 127,478 Hayes 1
1.2.2.6. Column - region_state

There are 416 unique regions

Below table represents the frequency count of top 10 and bottom 10 region_states

Top 10
Bottom 10
Region State Frequency Region State Frequency
New York-Northern New Jersey-Long Island NY-NJ-PA MSA 469,139 All other territories and foreign countries 7
Los Angeles-Long Beach-Santa Ana CA MSA 301,634 GU NONMETROPOLITAN AREA 166
Chicago-Naperville-Joliet IL-IN-WI MSA 293,304 MA NONMETROPOLITAN AREA 334
Dallas-Fort Worth-Arlington TX MSA 272,626 Danville IL MSA 737
Washington-Arlington-Alexandria DC-VA-MD-WV MSA 270,099 Madera CA MSA 784
San Francisco-Oakland-Fremont CA MSA 237,849 Palm Coast FL MSA 802
Atlanta-Sandy Springs-Marietta GA MSA 220,896 Pine Bluff AR MSA 860
Boston-Cambridge-Quincy MA-NH MSA 215,914 Bay City MI MSA 967
Philadelphia-Camden-Wilmington PA-NJ-DE-MD MSA 186,660 Hinesville-Fort Stewart GA MSA 1,058
San Jose-Sunnyvale-Santa Clara CA MSA 176,857 Lewiston ID-WA MSA 1,076
1.2.2.7. Column - company_ref

There are 3192 unique company references

Below table represents the frequency count of top 10 and bottom 10 company references

Top 10
Bottom 10
Company Reference Frequency Company Reference Frequency
CARE.COM INC 758,072 ADAMS RESOURCES & ENERGY INC 1
WELLS FARGO & COMPANY 243,217 ADIDAS 1
ARAMARK 152,702 ADVANCED EMISSIONS SOLUTIONS INC 1
KINDRED HEALTHCARE INC 147,931 AGNC INVESTMENT CORP 1
TARGET CORPORATION 139,739 AIR PRODUCTS AND CHEMICALS INC 1
MACY’S INC 130,728 AIR TRANSPORT SERVICES GROUP INC 1
MARRIOTT 116,405 ALLIN CORPORATION 1
EXPRESS INC 106,123 AMERICAN REALTY INVESTORS INC 1
THE SOUTHERN COMPANY 105,905 AMERICAS UNITED BANK 1
THE HOME DEPOT INC 100,318 AMKOR TECHNOLOGY INC 1
1.2.2.8. Column - ticker

There are 3153 unique tickers

Below table represents the frequency count of top 10 and bottom 10 tickers

Top 10
Bottom 10
Ticker Frequency Ticker Frequency
CRCM 758,072 ADDDF 1
WFC 243,217 ADES 1
MAR 174,368 AE 1
ARMK 152,702 AGNC 1
KND 147,931 AGTC 1
TGT 139,739 ALLN 1
M 130,728 AMGP 1
EXPR 106,123 AMKR 1
SO 105,905 AMPG 1
HD 100,318 APD 1

1.3. Time Series Analysis

Periodicity: Daily

Below graph shows the trend of job postings over the period April-2016 to March-2019.

2. Role

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? 25% Unique Rows
Missing values > 50% for one or more columns ? 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? Latest Date: 30-Mar-2019
Data contain PII (Personally Identificable Information) ?

2.1. Metadata Summary

2.1.1. Data Dimensions

Number of Columns: 3

Number of Rows: 6,539,168

Date Range: 04-Mar-2019 to 30-Mar-2019

2.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

2.1.3. Data Subset

Sample 6 rows of the dataset.

2.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

2.1.5. Missing Data

There are no missing records present in the data.

2.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

2.2.1. Histogram and Statistical Summary - Continuous Variable(s)

There are no relevant continuous variables to plot histograms

2.2.2. Frequency Counts - Categorical Variable(s)
2.2.2.1. Column - role

Below plot below shows top 10 and bottom 10 job roles in the company

There are about 1,667 different job roles in the data

Top 10
Bottom 10
Role Frequency Role Frequency
Sales/Marketing (all) 593,243 Application Assistant 3
Manager 505,026 Apprentice Plumber 3
Engineer 379,686 Behavioral Health Tech 3
Sales 335,743 Care Team Member 3
Associate 263,894 Carpenter Helper 3
Team Member 142,387 Chief Analytics Officer 3
Driver 140,769 Cutting Technician 3
Assistant 129,017 Desktop Support Administrator 3
salesperson 123,307 Dietetic Technician 3
service tech/mechanic 121,162 Disaster Recovery Manager 3

2.3. Time Series Analysis

Time series is plotted against the number of job posted against the day of the month in the data

3. Tag

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? 33.33% Unique Rows
Missing values > 50% for one or more columns ? 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? Latest Date: 30-Mar-2019
Data contain PII (Personally Identificable Information) ?

3.1. Metadata Summary

3.1.1. Data Dimensions

Number of Columns: 3

Number of Rows: 13,607,028

Date Range: 05-Mar-2019 to 30-Mar-2019

3.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

3.1.3. Data Subset

Sample 6 rows of the dataset.

3.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

3.1.5. Missing Data

There are no missing records present in the data.

3.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

3.2.1. Histogram and Statistical Summary - Continuous Variable(s)

There are no relevant continuous variables to plot histograms

3.2.2. Frequency Counts - Categorical Variable(s)
3.2.2.1. Column - tag

Below plot below shows top 10 and bottom 10 job tags in the company

There are about 1,846 different tags in the data

Top 10
Bottom 10
Tag Frequency Tag Frequency
sales 459,128 CouchDB 2
Team 418,784 Dispatch Systems 2
Operations 324,657 Mother Baby 2
Management 295,872 Neuropsychiatry 2
Design 252,709 Organ Donation 2
Engineering 220,825 PeopleCode 2
Hiring 193,161 PeopleSoft HRMS 2
Customer Service 192,642 Pulse Ox 2
Training 161,920 Solving Equations 2
Lead 160,402 appexchange 3

3.3. Time Series Analysis

Time series is plotted against the number of job posted against the day of the month in the data

4. Timelog

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? 20% Unique Rows
Missing values > 50% for one or more columns ? 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? Latest Date: 30-Mar-2019
Data contain PII (Personally Identificable Information) ?

4.1. Metadata Summary

4.1.1. Data Dimensions

Number of Columns: 5

Number of Rows: 148,139,575

Date Range: 25-Apr-2016 to 30-Mar-2019

4.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

4.1.3. Data Subset

Sample 6 rows of the dataset.

4.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

4.1.5. Missing Data

4.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

4.2.1. Histogram and Statistical Summary - Continuous Variable(s)

There are no relevant continuous columns to plot histograms

4.2.2. Frequency Counts - Categorical Variable(s)
4.2.2.1. Column - duration_days

Below plot below shows top 10 number of days for which listing was open

There are about 285 different number of days for which listings are open

4.3. Time Series Analysis

4.3.1. Column - post_date

Below graph shows the number of posts listed over 2016 to 2019

4.3.2. Column - remove_date

Below graph shows the number of posts removed over 2016 to 2019

5. Title

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? 50% Unique Rows
Missing values > 50% for one or more columns ? 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? Latest Date: 30-Mar-2019
Data contain PII (Personally Identificable Information) ?

5.1. Metadata Summary

5.1.1. Data Dimensions

Number of Columns: 3

Number of Rows: 1,294,582

Date Range: 04-Mar-2019 to 30-Mar-2019

5.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

5.1.3. Data Subset

Sample 6 rows of the dataset.

5.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

5.1.5. Missing Data

5.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

5.2.1. Histogram and Statistical Summary - Continuous Variable(s)

There are no relevant continuous variables to plot histograms

5.2.2. Frequency Counts - Categorical Variable(s)
5.2.2.1 Column - Title

There are about 298,762 different titles in the data

The table below shows Top 10 and Bottom 10 headline titles included in each job listings

Top 10
Bottom 10
Title Frequency Title Frequency
Assistant Manager 4,788 ‘Back Up’ Nanny Needed For 1 Child In Brooklyn 2
Sales Associate 4,772
  • Customer Support Engineer - Lancope Stealthwatch
2
Server 4,212
  • Experienced With Newborn/infants - CPR Certified - Active..
2
Delivery Driver 3,768
  • Personal/Office Assistant (1 Day A Week)
2
Store Manager 2,964
  • Retail Support - Receiving Team Lead, Flex: Augusta Mall
2
Dishwasher 2,682 -Plant Shift Supervisor - 2nd Shift Memphis, TN 2
Cook 2,672 -Senior Software Development Manager 2
Assistant Store Manager 2,644 !! Restaurant positions open !! 2
General Manager 2,626 !!! CAREGIVER PART TIME DAY SHIFT 6AM-2PM Starting Wages $15.. 2
Retail Sales Associate 2,536 !!! FULL TIME CAREGIVER PM Shift 2-10pm !!! 2

5.3. Time Series Analysis

Time series is plotted against the number of job posted against the day of the month in the data

Appendix 1: Data Summary

1 Univariate Analysis

Univariate Analysis involves the analysis of one variable at a time.

1.1 Histogram

A Histogram visualizes the distribution of a numerical field over its continuous range of values. Each bar in a histogram represents the tabulated frequency at each interval/bin.

1.2 Density Plot

A Density Plot visualizes the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.

1.3 QQ Plot

A QQ plot is a scatterplot created by plotting two sets of quantiles (theoretical and sample) against one another. The shape of the QQ plot indicates whether the data is normally distributed, skewed, or has a heavy tail.

2 Bivariate Analysis

Bivariate Analysis involves the analysis of two variables for the purpose of determining the empirical relationship between them. It explores the concept of the relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.

2.1 Correlation

A correlation matrix displays the coefficient of correlation for every pair of variables present in the dataset. This allows you to see which pairs have the highest correlation. It can be used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

3 Trend Charts

Trend charts are simple and efficient graphical representations of time-series data. Monthly trend charts can often reveal seasonal trends for a variable while yearly trend charts show trends over a longer period.

Appendix 2: Data Quality Scorecard

Tresvista - s Data Quality (DQ) framework is designed to assess the data quality and data health. Data quality monitoring is performed on an ongoing basis to ensure sustainable data quality.

Dimensions of Data Quality

A Data Quality Dimension is a term used to describe a data quality measure that can relate to multiple data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. While there are multiple parameters on which a dataset can be assessed in terms of quality, we have identified the following 5 core dimensions for our assessment.

1. Availability - Sufficient availability of data points

It is ensuring that enough data is available to end users and applications, when and where they need it for further analysis. This is particularly important, as many of the machine learning algorithms require enough data samples for training and testing/validating the models.

2. Completeness - All required data is captured, and no data is missing

It is identifying the percentage of records with non-NULL values. It can also be termed as comprehensiveness. Missing or incomplete data can hamper the analysis and affect the interpretability of the insights.

3. Uniqueness - Data redundancy should be avoided

It points out that there should be no data duplicates reported. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set.

4. Recency - Data is recent and not outdated

It is the degree to which information is recent with the current period. It measures how - up-to-date information is, and whether it is correct despite possible time-related changes.

5. Consistency - Uniform data types across the column

It refers to the data values in one column being consistent across the column. A strict definition of consistency specifies that two data values drawn from the same column must not conflict with each other (column level consistency).