Workforce Data

Product Structure Tree

Number of Files Received: 4

1. Gender

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	Yes	No	Comments
Duplicate rows > 10% ?		✓	Can not calculate unique rows
Missing values > 50% for one or more columns ?		✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?	✓		Latest Date: 2018-Dec
Data contain PII (Personally Identificable Information) ?		✓

1.1. Metadata Summary

1.1.1. Data Dimensions

Number of Columns: 10

Number of Rows: 733,282,691

Date Range: Jan-2008 to Dec-2018

1.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

1.1.3. Data Subset

Sample 6 rows of the dataset.

1.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

1.1.5. Missing Data

The plot shows the percentage of missing values for each column, color-coded on a spectrum from Green(0%) to Red(100%).

1.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

1.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

1.2.1.1 Column - count

Column count represents the expected number of employees in the position at a particular date

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

1.2.1.2 Column - inflow

Column inflow represents the expected number of employees moving into positions specified at that month

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

1.2.1.3 Column - outflow

Column outflow represents the expected number of employees moving out of positions specified at that month

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

1.2.2. Frequency Counts - Categorical Variable(s)

1.2.2.1. Column - company

Column company represents the name of the company

There are 4,716 unique company names

Below table represents the frequency count of top 10 and bottom 10 company names

Top 10		Bottom 10
Company	Frequency	Company	Frequency
Citigroup Inc	1,773,765	Shanghai Chlor-Alkali Chemical Co Ltd	257
General Electric Company	1,712,321	Max Financial Services Ltd	274
International Business Machines Corporation	1,577,087	Sino Biopharmaceutical Ltd	504
Alphabet Inc	1,455,052	Enochian Biosciences Inc	510
Oracle Corporation	1,408,051	Wrap Technologies Inc	796
Hewlett Packard Enterprise Company	1,407,815	Biospecifics Technologies Corp	1,024
Royal Dutch Shell PLC	1,401,654	City Office REIT Inc	1,043
HSBC Holdings PLC	1,397,208	Marker Therapeutics Inc	1,047
Facebook Inc	1,382,828	Riot Blockchain Inc	1,053
Dell Technologies Inc	1,355,712	Daito Trust Construction Co Ltd	1,092

1.2.2.2. Column - region

Column region represents the region name

1.2.2.3. Column - seniority

Column seniority represents the level of seniority of a position where there are 4 possible values: 1,2,3,4

1.2.2.4. Column - role_k50

Column role_K50 represents the job roles where there are 50 unique job roles

There are 51 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10		Bottom 10
Role K50	Frequency	Role K50	Frequency
senior_manager	88,624,734	barista	415,143
manager	47,264,698	pharmacist	1,096,909
technician	46,650,648	realtor	1,175,383
account_manager	39,564,016	software_engineer_\|_internet	1,231,679
project_manager	34,188,909	retail_sales_consultant	1,233,677
product_manager	33,749,692	cashier	1,308,013
director	31,680,632	clinical_research_associate	1,389,100
financial_analyst	30,762,309	financial_advisor	1,555,267
analyst_\|_information_technology_services	25,193,425	geologist	1,557,940
consultant	23,436,005	pilot	2,248,606

1.2.2.5. Column - role_k150

Column role_k150 represents the job role when there are 150 unique job roles

There are 151 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10		Bottom 10
Role K150	Frequency	Role K150	Frequency
manager	21,779,686	package_handler	183,271
sales	17,903,789	cashier_\|_restaurants	203,362
program_manager	17,687,710	tax_preparer	284,319
marketing_manager	15,490,913	specialist_\|_consumer_electronics	309,011
team_leader	14,526,238	avon_representative	323,740
hr_manager	13,477,507	rf_engineer	330,006
coordinator	12,807,910	area_manager_\|_internet	370,923
customer_service_representative	12,516,445	barista	415,143
senior_manager	12,399,990	assistant_director_admissions	437,276
gerente	12,232,697	pharmacy_technician	449,204

1.2.2.6. Column - gender

Column gender represents the estimated gender based on the first name of employees

1.3. Time Series Analysis

Periodicity: Monthly

Below graph shows the trend of positions from January 2008 to December 2019

2. Long

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	Yes	No	Comments
Duplicate rows > 10% ?		✓
Missing values > 50% for one or more columns ?		✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?	✓		Latest Date: 2018-Dec
Data contain PII (Personally Identificable Information) ?		✓

2.1. Metadata Summary

2.1.1. Data Dimensions

Number of Columns: 11

Number of Rows: 366,618,005

Date Range: Dec-2008 to Dec-2018

2.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

2.1.3. Data Subset

Sample 6 rows of the dataset.

2.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

2.1.5. Missing Data

There are no missing records present in the data.

2.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

2.2.1. Histogram and Statistical Summary - Continuous Variable(s)

2.2.1.1 Column - prestige

Column prestige represents the average prestige for employees in the position

o ##### 2.2.1.2 Column - salary

Column salary represents the expected total salary for employees in the position represented

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 1000).

2.2.1.3 Column - count

Column count represents the expected number of employees at a position

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

2.2.1.5 Column - inflow

Column inflow represents the expected number of employees moving into positions specified at that month

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

2.2.1.6 Column - outflow

Column outflow represents the expected number of employees moving out of positions specified at that month

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

2.2.2. Frequency Counts - Categorical Variable(s)

2.2.2.1. Column - company

Column company represents the name of the company

There are 4,716 unique company names

Below table represents the frequency count of top 10 and bottom 10 company names

Top 10		Bottom 10
Company	Frequency	Company	Frequency
Facebook Inc	1,125,126	Shanghai Chlor-Alkali Chemical Co Ltd	104
MICROSOFT CORPORATION	1,018,422	Max Financial Services Ltd	132
General Electric Company	896,695	Sino Biopharmaceutical Ltd	217
Caterpillar Inc	873,300	Enochian Biosciences Inc	260
GlaxoSmithKline PLC	817,960	Wrap Technologies Inc	274
International Business Machines Corporation	813,835	Riot Blockchain Inc	455
Apple, Inc	806,387	City Office REIT Inc	456
Royal Dutch Shell PLC	801,651	Biospecifics Technologies Corp	528
Exxon Mobil Corporation	798,659	Daito Trust Construction Co Ltd	528
Citigroup Inc	794,934	Marker Therapeutics Inc	611

2.2.2.2. Column - region

Column region represents the region name

2.2.2.3. Column - seniority

Column seniority represents the level of seniority of a position where there are 4 possible values: 1,2,3,4

2.2.2.4. Column - role_k50

Column role_K50 represents the job roles where there are 50 unique job roles

There are 51 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10		Bottom 10
Role K50	Frequency	Role K50	Frequency
senior_manager	44,283,472	barista	209,862
manager	23,653,844	pharmacist	558,516
technician	23,350,643	realtor	588,992
account_manager	19,781,922	retail_sales_consultant	616,437
project_manager	17,077,549	software_engineer_\|_internet	618,526
product_manager	16,851,306	cashier	656,132
director	15,856,308	clinical_research_associate	701,160
financial_analyst	15,389,108	financial_advisor	773,648
analyst_\|_information_technology_services	12,557,566	geologist	792,151
consultant	11,746,686	pilot	1,129,199

2.2.2.5. Column - role_k150

Column role_k150 represents the job role when there are 150 unique job roles

There are 151 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10		Bottom 10
Role K150	Frequency	Role K150	Frequency
manager	10,879,914	package_handler	92,160
sales	8,941,708	cashier_\|_restaurants	103,492
program_manager	8,827,383	tax_preparer	141,885
marketing_manager	7,738,659	rf_engineer	158,132
team_leader	7,248,967	specialist_\|_consumer_electronics	158,328
hr_manager	6,727,866	avon_representative	160,467
coordinator	6,402,989	area_manager_\|_internet	186,436
customer_service_representative	6,254,133	barista	209,862
senior_manager	6,209,316	assistant_director_admissions	218,827
gerente	6,115,574	pharmacy_technician	228,888

2.3. Time Series Analysis

Periodicity: Monthly

Below graph shows the trend of positions from January 2008 to December 2018

3. Postings

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	Yes	No	Comments
Duplicate rows > 10% ?		✓	100% Unique Rows
Missing values > 50% for one or more columns ?		✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?	✓		Latest Date: Dec-2018
Data contain PII (Personally Identificable Information) ?		✓

3.1. Metadata Summary

3.1.1. Data Dimensions

Number of Columns: 12

Number of Rows: 913,655

Date Range: April-2016 to December-2018

3.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

3.1.3. Data Subset

Sample 6 rows of the dataset.

3.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

3.1.5. Missing Data

3.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

3.2.1. Histogram and Statistical Summary - Continuous Variable(s)

3.2.1.1 Column - new_postings

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 300).

3.2.1.2 Column - active_postings

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 300).

3.2.1.3 Column - removed_postings

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 150).

3.2.1.4 Column - new_salary_avg

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 0 and less than 200000).

3.2.1.5 Column - active_salary_avg

3.2.1.6 Column - removed_salary_avg

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 10000 and less than 1000).

3.2.2. Frequency Counts - Categorical Variable(s)

3.2.2.1. Column - company

Column company represents the name of the company

There are 1,026 unique company names

Below table represents the frequency count of top 10 and bottom 10 company names

Top 10		Bottom 10
Company	Frequency	Company	Frequency
Amedisys Inc	35,635	5N Plus Inc	1
J.Crew Group, Inc	27,801	Acceleron Pharma Inc	1
AECOM	27,628	Axon Enterprise Inc	1
Anthem, Inc	25,168	Bactolac Pharmaceutical Inc	1
Ascena Retail Group, Inc	24,421	Bassett Furniture Industries Inc	1
Lockheed Martin Corporation	17,047	Carbon Black Inc	1
Tractor Supply Company	16,570	Cascade Microtech Inc	1
United Rentals Inc	14,667	Chegg Inc	1
Hot Topic Inc	14,538	CIT Group Inc	1
Boston Scientific Corporation	14,212	Crane Company	1

3.2.2.2. Column - region_state

Column region_state represents the region name

The table below shows the Top 10 and Bottom 10 regions

Top 10		Bottom 10
Region State	Frequency	Region State	Frequency
New York-Northern New Jersey-Long Island NY-NJ-PA MSA	34,527	MA NONMETROPOLITAN AREA	7
Boston-Cambridge-Quincy MA-NH MSA	28,657	Cumberland MD-WV MSA	22
Los Angeles-Long Beach-Santa Ana CA MSA	27,768	Danville IL MSA	29
Chicago-Naperville-Joliet IL-IN-WI MSA	24,628	Monroe MI MSA	39
San Francisco-Oakland-Fremont CA MSA	23,636	Lake Havasu City-Kingman AZ MSA	52
Dallas-Fort Worth-Arlington TX MSA	22,265	Weirton-Steubenville WV-OH MSA	53
Washington-Arlington-Alexandria DC-VA-MD-WV MSA	19,037	Bay City MI MSA	54
Philadelphia-Camden-Wilmington PA-NJ-DE-MD MSA	18,565	Yuba City CA MSA	55
Atlanta-Sandy Springs-Marietta GA MSA	18,131	Danville VA MSA	56
San Jose-Sunnyvale-Santa Clara CA MSA	17,006	Sandusky OH MSA	63

3.2.2.3. Column - seniority

Column seniority represents the level of seniority of a position where there are 4 possible values: 1,2,3,4

3.2.2.4. Column - role_k50

Column role_K50 represents the job roles where there are 50 unique job roles

There are 51 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10		Bottom 10
Role K50	Frequency	Role K50	Frequency
technician	82,522	software_engineer_\|_internet	110
manager	58,808	geologist	233
senior_manager	53,484	financial_advisor	530
account_manager	47,402	barista	710
rn	39,992	realtor	725
project_manager	35,365	tax_associate	1,033
customer_service_representative	30,510	pilot	1,169
product_manager	26,351	retail_sales_consultant	1,248
sales_associate	25,583	clinical_research_associate	1,552
financial_analyst	24,869	cashier	2,121

3.2.2.5. Column - role_k150

Column role_k150 represents the job role when there are 150 unique job roles

There are 151 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10		Bottom 10
Role K150	Frequency	Role K150	Frequency
sales	35,598	area_manager_\|_internet	4
customer_service_representative	30,398	specialist_\|_consumer_electronics	8
store_manager	24,382	avon_representative	9
sales_associate	21,430	cashier_\|_restaurants	25
operator	21,181	independent_distributor	35
program_manager	19,578	account_manager_\|_transportation_trucking_railroad	77
technician	18,133	sales_associate_\|_retail	90
concierge	17,901	terminal_manager	99
assistant_manager	16,968	rf_engineer	110
construction_manager	16,036	software_engineer_\|_internet	110

3.3. Time Series Analysis

Periodicity: Monthly

Below graph shows the trend of positions from April 2016 to December 2018

4. Transitions

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	Yes	No	Comments
Duplicate rows > 10% ?		✓	100% Unique Rows
Missing values > 50% for one or more columns ?		✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?	✓		Latest Date: Dec-2018
Data contain PII (Personally Identificable Information) ?		✓

4.1. Metadata Summary

4.1.1. Data Dimensions

Number of Columns: 8

Number of Rows: 24,004

Date Range: December-2007 to December-2018

4.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

4.1.3. Data Subset

Sample 6 rows of the dataset.

4.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

4.1.5. Missing Data

4.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

4.2.1. Histogram and Statistical Summary - Continuous Variable(s)

4.2.1.1 Column - outflow

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 10).

4.2.2. Frequency Counts - Categorical Variable(s)

4.2.2.1. Column - company_a

Column company represents the name of the company

Company	Frequency
MICROSOFT CORPORATION	7,137
International Business Machines Corporation	6,347
Oracle Corporation	4,828
Amazon.com, Inc	3,225
Alphabet Inc	2,467

4.2.2.2. Column - company_b

Column company represents the name of the company

Company	Frequency
MICROSOFT CORPORATION	6,612
Amazon.com, Inc	5,314
Alphabet Inc	4,482
Oracle Corporation	4,225
International Business Machines Corporation	3,371

4.2.2.3. Column - region_a

Column region_a represents the region name

The graph below shows Top 10 regions

4.2.2.4. Column - region_b

Column region_b represents the region name

The graph below shows Top 10 regions

4.2.2.5. Column - role_k50_a

Column role_K50_a represents the job roles where there are 50 unique job roles

There are 48 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10		Bottom 10
Role K50	Frequency	Role K50	Frequency
software_engineer	3,740	clinical_research_associate	1
account_manager_\|_information_technology_services	2,969	financial_advisor	1
analyst_\|_information_technology_services	2,221	pilot	1
account_manager	1,703	realtor	1
product_owner	1,649	cashier	2
senior_manager	1,539	server	2
project_manager_\|_information_technology_services	1,344	manager_\|_restaurants	3
product_manager	1,133	tax_associate	3
software_engineer_\|_internet	1,028	mortgage_loan_officer	5
network_engineer	860	agent	6

4.2.2.6. Column - role_k50_b

Column role_K50_b represents the job roles where there are 50 unique job roles

There are 48 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10		Bottom 10
Role K150	Frequency	Role K150	Frequency
software_engineer	3,062	cashier	1
account_manager_\|_information_technology_services	3,060	clinical_research_associate	1
software_engineer_\|_internet	2,231	geologist	1
product_owner	1,940	manager_\|_restaurants	2
account_manager	1,939	pharmacist	2
analyst_\|_information_technology_services	1,716	pilot	2
project_manager_\|_information_technology_services	1,286	mortgage_loan_officer	4
senior_manager	1,269	tax_associate	5
product_manager	1,160	realtor	6
empty	1,002	retail_sales_consultant	6

4.3. Time Series Analysis

Periodicity: Monthly

Below graph shows the trend of positions from December 2007 to December 2018

Appendix 1: Data Summary

1 Univariate Analysis

Univariate Analysis involves the analysis of one variable at a time.

1.1 Histogram

A Histogram visualizes the distribution of a numerical field over its continuous range of values. Each bar in a histogram represents the tabulated frequency at each interval/bin.

1.2 Density Plot

A Density Plot visualizes the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.

1.3 QQ Plot

A QQ plot is a scatterplot created by plotting two sets of quantiles (theoretical and sample) against one another. The shape of the QQ plot indicates whether the data is normally distributed, skewed, or has a heavy tail.

2 Bivariate Analysis

Bivariate Analysis involves the analysis of two variables for the purpose of determining the empirical relationship between them. It explores the concept of the relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.

2.1 Correlation

A correlation matrix displays the coefficient of correlation for every pair of variables present in the dataset. This allows you to see which pairs have the highest correlation. It can be used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

3 Trend Charts

Trend charts are simple and efficient graphical representations of time-series data. Monthly trend charts can often reveal seasonal trends for a variable while yearly trend charts show trends over a longer period.

Appendix 2: Data Quality Scorecard

Tresvista - s Data Quality (DQ) framework is designed to assess the data quality and data health. Data quality monitoring is performed on an ongoing basis to ensure sustainable data quality.

Dimensions of Data Quality

A Data Quality Dimension is a term used to describe a data quality measure that can relate to multiple data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. While there are multiple parameters on which a dataset can be assessed in terms of quality, we have identified the following 5 core dimensions for our assessment.

1. Availability - Sufficient availability of data points

It is ensuring that enough data is available to end users and applications, when and where they need it for further analysis. This is particularly important, as many of the machine learning algorithms require enough data samples for training and testing/validating the models.

2. Completeness - All required data is captured, and no data is missing

It is identifying the percentage of records with non-NULL values. It can also be termed as comprehensiveness. Missing or incomplete data can hamper the analysis and affect the interpretability of the insights.

3. Uniqueness - Data redundancy should be avoided

It points out that there should be no data duplicates reported. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set.

4. Recency - Data is recent and not outdated

It is the degree to which information is recent with the current period. It measures how - up-to-date information is, and whether it is correct despite possible time-related changes.

5. Consistency - Uniform data types across the column

It refers to the data values in one column being consistent across the column. A strict definition of consistency specifies that two data values drawn from the same column must not conflict with each other (column level consistency).