title
Workforce Data

BY

ETF_Logo


Product Structure Tree

Number of Files Received: 4

1. Gender

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? ✓ Can not calculate unique rows
Missing values > 50% for one or more columns ? ✓ 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? ✓ Latest Date: 2018-Dec
Data contain PII (Personally Identificable Information) ? ✓

1.1. Metadata Summary

1.1.1. Data Dimensions

Number of Columns: 10

Number of Rows: 733,282,691

Date Range: Jan-2008 to Dec-2018

1.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

1.1.3. Data Subset

Sample 6 rows of the dataset.

1.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values
1.1.5. Missing Data

The plot shows the percentage of missing values for each column, color-coded on a spectrum from Green(0%) to Red(100%).

1.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

1.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

1.2.1.1 Column - count

Column count represents the expected number of employees in the position at a particular date

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

1.2.1.2 Column - inflow

Column inflow represents the expected number of employees moving into positions specified at that month

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

1.2.1.3 Column - outflow

Column outflow represents the expected number of employees moving out of positions specified at that month

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

1.2.2. Frequency Counts - Categorical Variable(s)
1.2.2.1. Column - company

Column company represents the name of the company

There are 4,716 unique company names

Below table represents the frequency count of top 10 and bottom 10 company names

Top 10
Bottom 10
Company Frequency Company Frequency
Citigroup Inc 1,773,765 Shanghai Chlor-Alkali Chemical Co Ltd 257
General Electric Company 1,712,321 Max Financial Services Ltd 274
International Business Machines Corporation 1,577,087 Sino Biopharmaceutical Ltd 504
Alphabet Inc 1,455,052 Enochian Biosciences Inc 510
Oracle Corporation 1,408,051 Wrap Technologies Inc 796
Hewlett Packard Enterprise Company 1,407,815 Biospecifics Technologies Corp 1,024
Royal Dutch Shell PLC 1,401,654 City Office REIT Inc 1,043
HSBC Holdings PLC 1,397,208 Marker Therapeutics Inc 1,047
Facebook Inc 1,382,828 Riot Blockchain Inc 1,053
Dell Technologies Inc 1,355,712 Daito Trust Construction Co Ltd 1,092
1.2.2.2. Column - region

Column region represents the region name

1.2.2.3. Column - seniority

Column seniority represents the level of seniority of a position where there are 4 possible values: 1,2,3,4

1.2.2.4. Column - role_k50

Column role_K50 represents the job roles where there are 50 unique job roles

There are 51 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10
Bottom 10
Role K50 Frequency Role K50 Frequency
senior_manager 88,624,734 barista 415,143
manager 47,264,698 pharmacist 1,096,909
technician 46,650,648 realtor 1,175,383
account_manager 39,564,016 software_engineer_|_internet 1,231,679
project_manager 34,188,909 retail_sales_consultant 1,233,677
product_manager 33,749,692 cashier 1,308,013
director 31,680,632 clinical_research_associate 1,389,100
financial_analyst 30,762,309 financial_advisor 1,555,267
analyst_|_information_technology_services 25,193,425 geologist 1,557,940
consultant 23,436,005 pilot 2,248,606
1.2.2.5. Column - role_k150

Column role_k150 represents the job role when there are 150 unique job roles

There are 151 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10
Bottom 10
Role K150 Frequency Role K150 Frequency
manager 21,779,686 package_handler 183,271
sales 17,903,789 cashier_|_restaurants 203,362
program_manager 17,687,710 tax_preparer 284,319
marketing_manager 15,490,913 specialist_|_consumer_electronics 309,011
team_leader 14,526,238 avon_representative 323,740
hr_manager 13,477,507 rf_engineer 330,006
coordinator 12,807,910 area_manager_|_internet 370,923
customer_service_representative 12,516,445 barista 415,143
senior_manager 12,399,990 assistant_director_admissions 437,276
gerente 12,232,697 pharmacy_technician 449,204
1.2.2.6. Column - gender

Column gender represents the estimated gender based on the first name of employees

1.3. Time Series Analysis

Periodicity: Monthly

Below graph shows the trend of positions from January 2008 to December 2019

2. Long

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? ✓
Missing values > 50% for one or more columns ? ✓ 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? ✓ Latest Date: 2018-Dec
Data contain PII (Personally Identificable Information) ? ✓

2.1. Metadata Summary

2.1.1. Data Dimensions

Number of Columns: 11

Number of Rows: 366,618,005

Date Range: Dec-2008 to Dec-2018

2.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

2.1.3. Data Subset

Sample 6 rows of the dataset.

2.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

2.1.5. Missing Data

There are no missing records present in the data.

2.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

2.2.1. Histogram and Statistical Summary - Continuous Variable(s)
2.2.1.1 Column - prestige

Column prestige represents the average prestige for employees in the position

o ##### 2.2.1.2 Column - salary

Column salary represents the expected total salary for employees in the position represented

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 1000).

2.2.1.3 Column - count

Column count represents the expected number of employees at a position

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

2.2.1.5 Column - inflow

Column inflow represents the expected number of employees moving into positions specified at that month

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

2.2.1.6 Column - outflow

Column outflow represents the expected number of employees moving out of positions specified at that month

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).

2.2.2. Frequency Counts - Categorical Variable(s)
2.2.2.1. Column - company

Column company represents the name of the company

There are 4,716 unique company names

Below table represents the frequency count of top 10 and bottom 10 company names

Top 10
Bottom 10
Company Frequency Company Frequency
Facebook Inc 1,125,126 Shanghai Chlor-Alkali Chemical Co Ltd 104
MICROSOFT CORPORATION 1,018,422 Max Financial Services Ltd 132
General Electric Company 896,695 Sino Biopharmaceutical Ltd 217
Caterpillar Inc 873,300 Enochian Biosciences Inc 260
GlaxoSmithKline PLC 817,960 Wrap Technologies Inc 274
International Business Machines Corporation 813,835 Riot Blockchain Inc 455
Apple, Inc 806,387 City Office REIT Inc 456
Royal Dutch Shell PLC 801,651 Biospecifics Technologies Corp 528
Exxon Mobil Corporation 798,659 Daito Trust Construction Co Ltd 528
Citigroup Inc 794,934 Marker Therapeutics Inc 611
2.2.2.2. Column - region

Column region represents the region name

2.2.2.3. Column - seniority

Column seniority represents the level of seniority of a position where there are 4 possible values: 1,2,3,4

2.2.2.4. Column - role_k50

Column role_K50 represents the job roles where there are 50 unique job roles

There are 51 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10
Bottom 10
Role K50 Frequency Role K50 Frequency
senior_manager 44,283,472 barista 209,862
manager 23,653,844 pharmacist 558,516
technician 23,350,643 realtor 588,992
account_manager 19,781,922 retail_sales_consultant 616,437
project_manager 17,077,549 software_engineer_|_internet 618,526
product_manager 16,851,306 cashier 656,132
director 15,856,308 clinical_research_associate 701,160
financial_analyst 15,389,108 financial_advisor 773,648
analyst_|_information_technology_services 12,557,566 geologist 792,151
consultant 11,746,686 pilot 1,129,199
2.2.2.5. Column - role_k150

Column role_k150 represents the job role when there are 150 unique job roles

There are 151 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10
Bottom 10
Role K150 Frequency Role K150 Frequency
manager 10,879,914 package_handler 92,160
sales 8,941,708 cashier_|_restaurants 103,492
program_manager 8,827,383 tax_preparer 141,885
marketing_manager 7,738,659 rf_engineer 158,132
team_leader 7,248,967 specialist_|_consumer_electronics 158,328
hr_manager 6,727,866 avon_representative 160,467
coordinator 6,402,989 area_manager_|_internet 186,436
customer_service_representative 6,254,133 barista 209,862
senior_manager 6,209,316 assistant_director_admissions 218,827
gerente 6,115,574 pharmacy_technician 228,888

2.3. Time Series Analysis

Periodicity: Monthly

Below graph shows the trend of positions from January 2008 to December 2018

3. Postings

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? ✓ 100% Unique Rows
Missing values > 50% for one or more columns ? ✓ 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? ✓ Latest Date: Dec-2018
Data contain PII (Personally Identificable Information) ? ✓

3.1. Metadata Summary

3.1.1. Data Dimensions

Number of Columns: 12

Number of Rows: 913,655

Date Range: April-2016 to December-2018

3.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

3.1.3. Data Subset

Sample 6 rows of the dataset.

3.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

3.1.5. Missing Data

3.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

3.2.1. Histogram and Statistical Summary - Continuous Variable(s)
3.2.1.1 Column - new_postings

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 300).

3.2.1.2 Column - active_postings

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 300).

3.2.1.3 Column - removed_postings

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 150).

3.2.1.4 Column - new_salary_avg

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 0 and less than 200000).

3.2.1.5 Column - active_salary_avg

3.2.1.6 Column - removed_salary_avg

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 10000 and less than 1000).

3.2.2. Frequency Counts - Categorical Variable(s)
3.2.2.1. Column - company

Column company represents the name of the company

There are 1,026 unique company names

Below table represents the frequency count of top 10 and bottom 10 company names

Top 10
Bottom 10
Company Frequency Company Frequency
Amedisys Inc 35,635 5N Plus Inc 1
J.Crew Group, Inc 27,801 Acceleron Pharma Inc 1
AECOM 27,628 Axon Enterprise Inc 1
Anthem, Inc 25,168 Bactolac Pharmaceutical Inc 1
Ascena Retail Group, Inc 24,421 Bassett Furniture Industries Inc 1
Lockheed Martin Corporation 17,047 Carbon Black Inc 1
Tractor Supply Company 16,570 Cascade Microtech Inc 1
United Rentals Inc 14,667 Chegg Inc 1
Hot Topic Inc 14,538 CIT Group Inc 1
Boston Scientific Corporation 14,212 Crane Company 1
3.2.2.2. Column - region_state

Column region_state represents the region name

The table below shows the Top 10 and Bottom 10 regions

Top 10
Bottom 10
Region State Frequency Region State Frequency
New York-Northern New Jersey-Long Island NY-NJ-PA MSA 34,527 MA NONMETROPOLITAN AREA 7
Boston-Cambridge-Quincy MA-NH MSA 28,657 Cumberland MD-WV MSA 22
Los Angeles-Long Beach-Santa Ana CA MSA 27,768 Danville IL MSA 29
Chicago-Naperville-Joliet IL-IN-WI MSA 24,628 Monroe MI MSA 39
San Francisco-Oakland-Fremont CA MSA 23,636 Lake Havasu City-Kingman AZ MSA 52
Dallas-Fort Worth-Arlington TX MSA 22,265 Weirton-Steubenville WV-OH MSA 53
Washington-Arlington-Alexandria DC-VA-MD-WV MSA 19,037 Bay City MI MSA 54
Philadelphia-Camden-Wilmington PA-NJ-DE-MD MSA 18,565 Yuba City CA MSA 55
Atlanta-Sandy Springs-Marietta GA MSA 18,131 Danville VA MSA 56
San Jose-Sunnyvale-Santa Clara CA MSA 17,006 Sandusky OH MSA 63
3.2.2.3. Column - seniority

Column seniority represents the level of seniority of a position where there are 4 possible values: 1,2,3,4

3.2.2.4. Column - role_k50

Column role_K50 represents the job roles where there are 50 unique job roles

There are 51 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10
Bottom 10
Role K50 Frequency Role K50 Frequency
technician 82,522 software_engineer_|_internet 110
manager 58,808 geologist 233
senior_manager 53,484 financial_advisor 530
account_manager 47,402 barista 710
rn 39,992 realtor 725
project_manager 35,365 tax_associate 1,033
customer_service_representative 30,510 pilot 1,169
product_manager 26,351 retail_sales_consultant 1,248
sales_associate 25,583 clinical_research_associate 1,552
financial_analyst 24,869 cashier 2,121
3.2.2.5. Column - role_k150

Column role_k150 represents the job role when there are 150 unique job roles

There are 151 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10
Bottom 10
Role K150 Frequency Role K150 Frequency
sales 35,598 area_manager_|_internet 4
customer_service_representative 30,398 specialist_|_consumer_electronics 8
store_manager 24,382 avon_representative 9
sales_associate 21,430 cashier_|_restaurants 25
operator 21,181 independent_distributor 35
program_manager 19,578 account_manager_|_transportation_trucking_railroad 77
technician 18,133 sales_associate_|_retail 90
concierge 17,901 terminal_manager 99
assistant_manager 16,968 rf_engineer 110
construction_manager 16,036 software_engineer_|_internet 110

3.3. Time Series Analysis

Periodicity: Monthly

Below graph shows the trend of positions from April 2016 to December 2018

4. Transitions

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? ✓ 100% Unique Rows
Missing values > 50% for one or more columns ? ✓ 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? ✓ Latest Date: Dec-2018
Data contain PII (Personally Identificable Information) ? ✓

4.1. Metadata Summary

4.1.1. Data Dimensions

Number of Columns: 8

Number of Rows: 24,004

Date Range: December-2007 to December-2018

4.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

4.1.3. Data Subset

Sample 6 rows of the dataset.

4.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

4.1.5. Missing Data

4.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

4.2.1. Histogram and Statistical Summary - Continuous Variable(s)
4.2.1.1 Column - outflow

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 10).

4.2.2. Frequency Counts - Categorical Variable(s)
4.2.2.1. Column - company_a

Column company represents the name of the company

Company Frequency
MICROSOFT CORPORATION 7,137
International Business Machines Corporation 6,347
Oracle Corporation 4,828
Amazon.com, Inc 3,225
Alphabet Inc 2,467
4.2.2.2. Column - company_b

Column company represents the name of the company

Company Frequency
MICROSOFT CORPORATION 6,612
Amazon.com, Inc 5,314
Alphabet Inc 4,482
Oracle Corporation 4,225
International Business Machines Corporation 3,371
4.2.2.3. Column - region_a

Column region_a represents the region name

The graph below shows Top 10 regions

4.2.2.4. Column - region_b

Column region_b represents the region name

The graph below shows Top 10 regions

4.2.2.5. Column - role_k50_a

Column role_K50_a represents the job roles where there are 50 unique job roles

There are 48 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10
Bottom 10
Role K50 Frequency Role K50 Frequency
software_engineer 3,740 clinical_research_associate 1
account_manager_|_information_technology_services 2,969 financial_advisor 1
analyst_|_information_technology_services 2,221 pilot 1
account_manager 1,703 realtor 1
product_owner 1,649 cashier 2
senior_manager 1,539 server 2
project_manager_|_information_technology_services 1,344 manager_|_restaurants 3
product_manager 1,133 tax_associate 3
software_engineer_|_internet 1,028 mortgage_loan_officer 5
network_engineer 860 agent 6
4.2.2.6. Column - role_k50_b

Column role_K50_b represents the job roles where there are 50 unique job roles

There are 48 unique company names

Below table represents the frequency count of top 10 and bottom 10 roles

Top 10
Bottom 10
Role K150 Frequency Role K150 Frequency
software_engineer 3,062 cashier 1
account_manager_|_information_technology_services 3,060 clinical_research_associate 1
software_engineer_|_internet 2,231 geologist 1
product_owner 1,940 manager_|_restaurants 2
account_manager 1,939 pharmacist 2
analyst_|_information_technology_services 1,716 pilot 2
project_manager_|_information_technology_services 1,286 mortgage_loan_officer 4
senior_manager 1,269 tax_associate 5
product_manager 1,160 realtor 6
empty 1,002 retail_sales_consultant 6

4.3. Time Series Analysis

Periodicity: Monthly

Below graph shows the trend of positions from December 2007 to December 2018

Appendix 1: Data Summary

1 Univariate Analysis

Univariate Analysis involves the analysis of one variable at a time.

1.1 Histogram

A Histogram visualizes the distribution of a numerical field over its continuous range of values. Each bar in a histogram represents the tabulated frequency at each interval/bin.

1.2 Density Plot

A Density Plot visualizes the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.

1.3 QQ Plot

A QQ plot is a scatterplot created by plotting two sets of quantiles (theoretical and sample) against one another. The shape of the QQ plot indicates whether the data is normally distributed, skewed, or has a heavy tail.

2 Bivariate Analysis

Bivariate Analysis involves the analysis of two variables for the purpose of determining the empirical relationship between them. It explores the concept of the relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.

2.1 Correlation

A correlation matrix displays the coefficient of correlation for every pair of variables present in the dataset. This allows you to see which pairs have the highest correlation. It can be used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

3 Trend Charts

Trend charts are simple and efficient graphical representations of time-series data. Monthly trend charts can often reveal seasonal trends for a variable while yearly trend charts show trends over a longer period.

Appendix 2: Data Quality Scorecard

Tresvista - s Data Quality (DQ) framework is designed to assess the data quality and data health. Data quality monitoring is performed on an ongoing basis to ensure sustainable data quality.

Dimensions of Data Quality

A Data Quality Dimension is a term used to describe a data quality measure that can relate to multiple data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. While there are multiple parameters on which a dataset can be assessed in terms of quality, we have identified the following 5 core dimensions for our assessment.

1. Availability - Sufficient availability of data points

It is ensuring that enough data is available to end users and applications, when and where they need it for further analysis. This is particularly important, as many of the machine learning algorithms require enough data samples for training and testing/validating the models.

2. Completeness - All required data is captured, and no data is missing

It is identifying the percentage of records with non-NULL values. It can also be termed as comprehensiveness. Missing or incomplete data can hamper the analysis and affect the interpretability of the insights.

3. Uniqueness - Data redundancy should be avoided

It points out that there should be no data duplicates reported. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set.

4. Recency - Data is recent and not outdated

It is the degree to which information is recent with the current period. It measures how - up-to-date information is, and whether it is correct despite possible time-related changes.

5. Consistency - Uniform data types across the column

It refers to the data values in one column being consistent across the column. A strict definition of consistency specifies that two data values drawn from the same column must not conflict with each other (column level consistency).