BY
Number of Files Received: 4
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | Can not calculate unique rows | |
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: 2018-Dec | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 10
Number of Rows: 733,282,691
Date Range: Jan-2008 to Dec-2018
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing valuesThe plot shows the percentage of missing values for each column, color-coded on a spectrum from Green(0%) to Red(100%).
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.
Column count represents the expected number of employees in the position at a particular date
Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).
Column inflow represents the expected number of employees moving into positions specified at that month
Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).
Column outflow represents the expected number of employees moving out of positions specified at that month
Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).
Column company represents the name of the company
There are 4,716 unique company names
Below table represents the frequency count of top 10 and bottom 10 company names
| Company | Frequency | Company | Frequency |
|---|---|---|---|
| Citigroup Inc | 1,773,765 | Shanghai Chlor-Alkali Chemical Co Ltd | 257 |
| General Electric Company | 1,712,321 | Max Financial Services Ltd | 274 |
| International Business Machines Corporation | 1,577,087 | Sino Biopharmaceutical Ltd | 504 |
| Alphabet Inc | 1,455,052 | Enochian Biosciences Inc | 510 |
| Oracle Corporation | 1,408,051 | Wrap Technologies Inc | 796 |
| Hewlett Packard Enterprise Company | 1,407,815 | Biospecifics Technologies Corp | 1,024 |
| Royal Dutch Shell PLC | 1,401,654 | City Office REIT Inc | 1,043 |
| HSBC Holdings PLC | 1,397,208 | Marker Therapeutics Inc | 1,047 |
| Facebook Inc | 1,382,828 | Riot Blockchain Inc | 1,053 |
| Dell Technologies Inc | 1,355,712 | Daito Trust Construction Co Ltd | 1,092 |
Column region represents the region name
Column seniority represents the level of seniority of a position where there are 4 possible values: 1,2,3,4
Column role_K50 represents the job roles where there are 50 unique job roles
There are 51 unique company names
Below table represents the frequency count of top 10 and bottom 10 roles
| Role K50 | Frequency | Role K50 | Frequency |
|---|---|---|---|
| senior_manager | 88,624,734 | barista | 415,143 |
| manager | 47,264,698 | pharmacist | 1,096,909 |
| technician | 46,650,648 | realtor | 1,175,383 |
| account_manager | 39,564,016 | software_engineer_|_internet | 1,231,679 |
| project_manager | 34,188,909 | retail_sales_consultant | 1,233,677 |
| product_manager | 33,749,692 | cashier | 1,308,013 |
| director | 31,680,632 | clinical_research_associate | 1,389,100 |
| financial_analyst | 30,762,309 | financial_advisor | 1,555,267 |
| analyst_|_information_technology_services | 25,193,425 | geologist | 1,557,940 |
| consultant | 23,436,005 | pilot | 2,248,606 |
Column role_k150 represents the job role when there are 150 unique job roles
There are 151 unique company names
Below table represents the frequency count of top 10 and bottom 10 roles
| Role K150 | Frequency | Role K150 | Frequency |
|---|---|---|---|
| manager | 21,779,686 | package_handler | 183,271 |
| sales | 17,903,789 | cashier_|_restaurants | 203,362 |
| program_manager | 17,687,710 | tax_preparer | 284,319 |
| marketing_manager | 15,490,913 | specialist_|_consumer_electronics | 309,011 |
| team_leader | 14,526,238 | avon_representative | 323,740 |
| hr_manager | 13,477,507 | rf_engineer | 330,006 |
| coordinator | 12,807,910 | area_manager_|_internet | 370,923 |
| customer_service_representative | 12,516,445 | barista | 415,143 |
| senior_manager | 12,399,990 | assistant_director_admissions | 437,276 |
| gerente | 12,232,697 | pharmacy_technician | 449,204 |
Column gender represents the estimated gender based on the first name of employees
Periodicity: Monthly
Below graph shows the trend of positions from January 2008 to December 2019
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | ||
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: 2018-Dec | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 11
Number of Rows: 366,618,005
Date Range: Dec-2008 to Dec-2018
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
There are no missing records present in the data.
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Column prestige represents the average prestige for employees in the position
o ##### 2.2.1.2 Column - salary
Column salary represents the expected total salary for employees in the position represented
Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 1000).
Column count represents the expected number of employees at a position
Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).
Column inflow represents the expected number of employees moving into positions specified at that month
Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).
Column outflow represents the expected number of employees moving out of positions specified at that month
Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 60 and less than 500).
Column company represents the name of the company
There are 4,716 unique company names
Below table represents the frequency count of top 10 and bottom 10 company names
| Company | Frequency | Company | Frequency |
|---|---|---|---|
| Facebook Inc | 1,125,126 | Shanghai Chlor-Alkali Chemical Co Ltd | 104 |
| MICROSOFT CORPORATION | 1,018,422 | Max Financial Services Ltd | 132 |
| General Electric Company | 896,695 | Sino Biopharmaceutical Ltd | 217 |
| Caterpillar Inc | 873,300 | Enochian Biosciences Inc | 260 |
| GlaxoSmithKline PLC | 817,960 | Wrap Technologies Inc | 274 |
| International Business Machines Corporation | 813,835 | Riot Blockchain Inc | 455 |
| Apple, Inc | 806,387 | City Office REIT Inc | 456 |
| Royal Dutch Shell PLC | 801,651 | Biospecifics Technologies Corp | 528 |
| Exxon Mobil Corporation | 798,659 | Daito Trust Construction Co Ltd | 528 |
| Citigroup Inc | 794,934 | Marker Therapeutics Inc | 611 |
Column region represents the region name
Column seniority represents the level of seniority of a position where there are 4 possible values: 1,2,3,4
Column role_K50 represents the job roles where there are 50 unique job roles
There are 51 unique company names
Below table represents the frequency count of top 10 and bottom 10 roles
| Role K50 | Frequency | Role K50 | Frequency |
|---|---|---|---|
| senior_manager | 44,283,472 | barista | 209,862 |
| manager | 23,653,844 | pharmacist | 558,516 |
| technician | 23,350,643 | realtor | 588,992 |
| account_manager | 19,781,922 | retail_sales_consultant | 616,437 |
| project_manager | 17,077,549 | software_engineer_|_internet | 618,526 |
| product_manager | 16,851,306 | cashier | 656,132 |
| director | 15,856,308 | clinical_research_associate | 701,160 |
| financial_analyst | 15,389,108 | financial_advisor | 773,648 |
| analyst_|_information_technology_services | 12,557,566 | geologist | 792,151 |
| consultant | 11,746,686 | pilot | 1,129,199 |
Column role_k150 represents the job role when there are 150 unique job roles
There are 151 unique company names
Below table represents the frequency count of top 10 and bottom 10 roles
| Role K150 | Frequency | Role K150 | Frequency |
|---|---|---|---|
| manager | 10,879,914 | package_handler | 92,160 |
| sales | 8,941,708 | cashier_|_restaurants | 103,492 |
| program_manager | 8,827,383 | tax_preparer | 141,885 |
| marketing_manager | 7,738,659 | rf_engineer | 158,132 |
| team_leader | 7,248,967 | specialist_|_consumer_electronics | 158,328 |
| hr_manager | 6,727,866 | avon_representative | 160,467 |
| coordinator | 6,402,989 | area_manager_|_internet | 186,436 |
| customer_service_representative | 6,254,133 | barista | 209,862 |
| senior_manager | 6,209,316 | assistant_director_admissions | 218,827 |
| gerente | 6,115,574 | pharmacy_technician | 228,888 |
Periodicity: Monthly
Below graph shows the trend of positions from January 2008 to December 2018
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 100% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: Dec-2018 | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 12
Number of Rows: 913,655
Date Range: April-2016 to December-2018
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Column company represents the name of the company
There are 1,026 unique company names
Below table represents the frequency count of top 10 and bottom 10 company names
| Company | Frequency | Company | Frequency |
|---|---|---|---|
| Amedisys Inc | 35,635 | 5N Plus Inc | 1 |
| J.Crew Group, Inc | 27,801 | Acceleron Pharma Inc | 1 |
| AECOM | 27,628 | Axon Enterprise Inc | 1 |
| Anthem, Inc | 25,168 | Bactolac Pharmaceutical Inc | 1 |
| Ascena Retail Group, Inc | 24,421 | Bassett Furniture Industries Inc | 1 |
| Lockheed Martin Corporation | 17,047 | Carbon Black Inc | 1 |
| Tractor Supply Company | 16,570 | Cascade Microtech Inc | 1 |
| United Rentals Inc | 14,667 | Chegg Inc | 1 |
| Hot Topic Inc | 14,538 | CIT Group Inc | 1 |
| Boston Scientific Corporation | 14,212 | Crane Company | 1 |
Column region_state represents the region name
The table below shows the Top 10 and Bottom 10 regions
| Region State | Frequency | Region State | Frequency |
|---|---|---|---|
| New York-Northern New Jersey-Long Island NY-NJ-PA MSA | 34,527 | MA NONMETROPOLITAN AREA | 7 |
| Boston-Cambridge-Quincy MA-NH MSA | 28,657 | Cumberland MD-WV MSA | 22 |
| Los Angeles-Long Beach-Santa Ana CA MSA | 27,768 | Danville IL MSA | 29 |
| Chicago-Naperville-Joliet IL-IN-WI MSA | 24,628 | Monroe MI MSA | 39 |
| San Francisco-Oakland-Fremont CA MSA | 23,636 | Lake Havasu City-Kingman AZ MSA | 52 |
| Dallas-Fort Worth-Arlington TX MSA | 22,265 | Weirton-Steubenville WV-OH MSA | 53 |
| Washington-Arlington-Alexandria DC-VA-MD-WV MSA | 19,037 | Bay City MI MSA | 54 |
| Philadelphia-Camden-Wilmington PA-NJ-DE-MD MSA | 18,565 | Yuba City CA MSA | 55 |
| Atlanta-Sandy Springs-Marietta GA MSA | 18,131 | Danville VA MSA | 56 |
| San Jose-Sunnyvale-Santa Clara CA MSA | 17,006 | Sandusky OH MSA | 63 |
Column seniority represents the level of seniority of a position where there are 4 possible values: 1,2,3,4
Column role_K50 represents the job roles where there are 50 unique job roles
There are 51 unique company names
Below table represents the frequency count of top 10 and bottom 10 roles
| Role K50 | Frequency | Role K50 | Frequency |
|---|---|---|---|
| technician | 82,522 | software_engineer_|_internet | 110 |
| manager | 58,808 | geologist | 233 |
| senior_manager | 53,484 | financial_advisor | 530 |
| account_manager | 47,402 | barista | 710 |
| rn | 39,992 | realtor | 725 |
| project_manager | 35,365 | tax_associate | 1,033 |
| customer_service_representative | 30,510 | pilot | 1,169 |
| product_manager | 26,351 | retail_sales_consultant | 1,248 |
| sales_associate | 25,583 | clinical_research_associate | 1,552 |
| financial_analyst | 24,869 | cashier | 2,121 |
Column role_k150 represents the job role when there are 150 unique job roles
There are 151 unique company names
Below table represents the frequency count of top 10 and bottom 10 roles
| Role K150 | Frequency | Role K150 | Frequency |
|---|---|---|---|
| sales | 35,598 | area_manager_|_internet | 4 |
| customer_service_representative | 30,398 | specialist_|_consumer_electronics | 8 |
| store_manager | 24,382 | avon_representative | 9 |
| sales_associate | 21,430 | cashier_|_restaurants | 25 |
| operator | 21,181 | independent_distributor | 35 |
| program_manager | 19,578 | account_manager_|_transportation_trucking_railroad | 77 |
| technician | 18,133 | sales_associate_|_retail | 90 |
| concierge | 17,901 | terminal_manager | 99 |
| assistant_manager | 16,968 | rf_engineer | 110 |
| construction_manager | 16,036 | software_engineer_|_internet | 110 |
Periodicity: Monthly
Below graph shows the trend of positions from April 2016 to December 2018
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 100% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: Dec-2018 | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 8
Number of Rows: 24,004
Date Range: December-2007 to December-2018
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Column company represents the name of the company
| Company | Frequency |
|---|---|
| MICROSOFT CORPORATION | 7,137 |
| International Business Machines Corporation | 6,347 |
| Oracle Corporation | 4,828 |
| Amazon.com, Inc | 3,225 |
| Alphabet Inc | 2,467 |
Column company represents the name of the company
| Company | Frequency |
|---|---|
| MICROSOFT CORPORATION | 6,612 |
| Amazon.com, Inc | 5,314 |
| Alphabet Inc | 4,482 |
| Oracle Corporation | 4,225 |
| International Business Machines Corporation | 3,371 |
Column region_a represents the region name
The graph below shows Top 10 regions
Column region_b represents the region name
The graph below shows Top 10 regions
Column role_K50_a represents the job roles where there are 50 unique job roles
There are 48 unique company names
Below table represents the frequency count of top 10 and bottom 10 roles
| Role K50 | Frequency | Role K50 | Frequency |
|---|---|---|---|
| software_engineer | 3,740 | clinical_research_associate | 1 |
| account_manager_|_information_technology_services | 2,969 | financial_advisor | 1 |
| analyst_|_information_technology_services | 2,221 | pilot | 1 |
| account_manager | 1,703 | realtor | 1 |
| product_owner | 1,649 | cashier | 2 |
| senior_manager | 1,539 | server | 2 |
| project_manager_|_information_technology_services | 1,344 | manager_|_restaurants | 3 |
| product_manager | 1,133 | tax_associate | 3 |
| software_engineer_|_internet | 1,028 | mortgage_loan_officer | 5 |
| network_engineer | 860 | agent | 6 |
Column role_K50_b represents the job roles where there are 50 unique job roles
There are 48 unique company names
Below table represents the frequency count of top 10 and bottom 10 roles
| Role K150 | Frequency | Role K150 | Frequency |
|---|---|---|---|
| software_engineer | 3,062 | cashier | 1 |
| account_manager_|_information_technology_services | 3,060 | clinical_research_associate | 1 |
| software_engineer_|_internet | 2,231 | geologist | 1 |
| product_owner | 1,940 | manager_|_restaurants | 2 |
| account_manager | 1,939 | pharmacist | 2 |
| analyst_|_information_technology_services | 1,716 | pilot | 2 |
| project_manager_|_information_technology_services | 1,286 | mortgage_loan_officer | 4 |
| senior_manager | 1,269 | tax_associate | 5 |
| product_manager | 1,160 | realtor | 6 |
| empty | 1,002 | retail_sales_consultant | 6 |
Periodicity: Monthly
Below graph shows the trend of positions from December 2007 to December 2018
Univariate Analysis involves the analysis of one variable at a time.
A Histogram visualizes the distribution of a numerical field over its continuous range of values. Each bar in a histogram represents the tabulated frequency at each interval/bin.
A Density Plot visualizes the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.
A QQ plot is a scatterplot created by plotting two sets of quantiles (theoretical and sample) against one another. The shape of the QQ plot indicates whether the data is normally distributed, skewed, or has a heavy tail.
Bivariate Analysis involves the analysis of two variables for the purpose of determining the empirical relationship between them. It explores the concept of the relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.
A correlation matrix displays the coefficient of correlation for every pair of variables present in the dataset. This allows you to see which pairs have the highest correlation. It can be used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.
Trend charts are simple and efficient graphical representations of time-series data. Monthly trend charts can often reveal seasonal trends for a variable while yearly trend charts show trends over a longer period.
Tresvista - s Data Quality (DQ) framework is designed to assess the data quality and data health. Data quality monitoring is performed on an ongoing basis to ensure sustainable data quality.
A Data Quality Dimension is a term used to describe a data quality measure that can relate to multiple data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. While there are multiple parameters on which a dataset can be assessed in terms of quality, we have identified the following 5 core dimensions for our assessment.
It is ensuring that enough data is available to end users and applications, when and where they need it for further analysis. This is particularly important, as many of the machine learning algorithms require enough data samples for training and testing/validating the models.
It is identifying the percentage of records with non-NULL values. It can also be termed as comprehensiveness. Missing or incomplete data can hamper the analysis and affect the interpretability of the insights.
It points out that there should be no data duplicates reported. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set.
It is the degree to which information is recent with the current period. It measures how - up-to-date information is, and whether it is correct despite possible time-related changes.
It refers to the data values in one column being consistent across the column. A strict definition of consistency specifies that two data values drawn from the same column must not conflict with each other (column level consistency).