BY
Number of Files Received: 5
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 92.82% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 1 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: 30-Mar-2019 | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 21
Number of Rows: 9,948,150
Date Range: 25-Apr-2016 to 30-Mar-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing valuesThe plot shows the percentage of missing values for each column, color-coded on a spectrum from Green(0%) to Red(100%).
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.
Column salary represents the estimated pay level
There are 6193 unique company names
Below table represents the frequency count of top 10 and bottom 10 company names
| Company Name | Frequency | Company Name | Frequency |
|---|---|---|---|
| Care.Com | 670,514 | 3M Company | 1 |
| Aramark | 152,257 | 3M Health Information Systems | 1 |
| Wells+Fargo | 132,242 | A f l a c | 1 |
| Marriott International Inc | 111,047 | Acadia Healhcare | 1 |
| Wells Fargo | 110,970 | ACADIA HEALTHCARE | 1 |
| Express | 106,088 | Adams Resources & Energy Inc | 1 |
| Southern | 105,905 | adidas AG | 1 |
| Care.com | 87,558 | Advanced Emission Control Solutions | 1 |
| TARGET | 82,568 | Aflac - Capital Region | 1 |
| Lockheed+Martin | 79,254 | Aflac Careers | 1 |
There are 54050 unique locations
Below table represents the frequency count of top 10 and bottom 10 locations
| Location | Frequency | Location | Frequency |
|---|---|---|---|
| New York NY | 77,567 | 65th Infantry PR 00,923 | 1 |
| Atlanta GA | 66,526 | Abbeville LA 70,511 | 1 |
| Seattle WA | 65,405 | Abbeville MS 38,601 | 1 |
| Charlotte NC | 58,432 | Abbott TX 76,621 | 1 |
| Houston TX | 57,550 | Abbottstown PA | 1 |
| Chicago IL | 51,836 | Abbyville KS 67,510 | 1 |
| United States | 50,547 | Aberdeen IN | 1 |
| San Francisco CA | 44,994 | Abie NE 68,001 | 1 |
| Minneapolis MN | 39,806 | Abington CT 06,230 | 1 |
| Dallas TX | 39,739 | Abington NC | 1 |
There are 17370 unique cities
Below table represents the frequency count of top 10 and bottom 10 cities
| City | Frequency | City | Frequency |
|---|---|---|---|
| New York | 164,137 | 65th Infantry | 1 |
| Houston | 108,095 | Abbyville | 1 |
| Atlanta | 103,103 | Abie | 1 |
| Chicago | 102,014 | Academy | 1 |
| Charlotte | 90,206 | Adah | 1 |
| Seattle | 89,611 | Adair Village | 1 |
| San Francisco | 78,337 | Adairville | 1 |
| Dallas | 74,010 | Adams County | 1 |
| Austin | 71,042 | Addy | 1 |
| Phoenix | 66,116 | Adelphia | 1 |
There are 52 unique states
Below table represents the frequency count of top 10 and bottom 10 states
| State | Frequency | State | Frequency |
|---|---|---|---|
| California | 1,066,787 | Wyoming | 9,978 |
| Texas | 725,249 | Vermont | 12,031 |
| Florida | 481,256 | Puerto Rico | 12,591 |
| New York | 423,525 | Alaska | 13,675 |
| Illinois | 346,562 | Montana | 16,416 |
| Pennsylvania | 326,239 | South Dakota | 16,886 |
| North Carolina | 307,983 | North Dakota | 19,467 |
| Georgia | 306,908 | Maine | 27,760 |
| Virginia | 292,574 | Idaho | 31,023 |
| Ohio | 276,539 | Rhode Island | 31,751 |
There are 1837 unique counties
Below table represents the frequency count of top 10 and bottom 10 counties
| County | Frequency | County | Frequency |
|---|---|---|---|
| Los Angeles | 219,785 | Benson | 1 |
| King | 180,297 | Billings | 1 |
| Santa Clara | 176,393 | Broomfield | 1 |
| New York | 165,407 | Camas | 1 |
| Cook | 161,817 | Dundy | 1 |
| Orange | 149,905 | Eureka | 1 |
| Maricopa | 144,711 | Faulk | 1 |
| Dallas | 141,594 | Haakon | 1 |
| Harris | 130,443 | Harding | 1 |
| Middlesex | 127,478 | Hayes | 1 |
There are 416 unique regions
Below table represents the frequency count of top 10 and bottom 10 region_states
| Region State | Frequency | Region State | Frequency |
|---|---|---|---|
| New York-Northern New Jersey-Long Island NY-NJ-PA MSA | 469,139 | All other territories and foreign countries | 7 |
| Los Angeles-Long Beach-Santa Ana CA MSA | 301,634 | GU NONMETROPOLITAN AREA | 166 |
| Chicago-Naperville-Joliet IL-IN-WI MSA | 293,304 | MA NONMETROPOLITAN AREA | 334 |
| Dallas-Fort Worth-Arlington TX MSA | 272,626 | Danville IL MSA | 737 |
| Washington-Arlington-Alexandria DC-VA-MD-WV MSA | 270,099 | Madera CA MSA | 784 |
| San Francisco-Oakland-Fremont CA MSA | 237,849 | Palm Coast FL MSA | 802 |
| Atlanta-Sandy Springs-Marietta GA MSA | 220,896 | Pine Bluff AR MSA | 860 |
| Boston-Cambridge-Quincy MA-NH MSA | 215,914 | Bay City MI MSA | 967 |
| Philadelphia-Camden-Wilmington PA-NJ-DE-MD MSA | 186,660 | Hinesville-Fort Stewart GA MSA | 1,058 |
| San Jose-Sunnyvale-Santa Clara CA MSA | 176,857 | Lewiston ID-WA MSA | 1,076 |
There are 3192 unique company references
Below table represents the frequency count of top 10 and bottom 10 company references
| Company Reference | Frequency | Company Reference | Frequency |
|---|---|---|---|
| CARE.COM INC | 758,072 | ADAMS RESOURCES & ENERGY INC | 1 |
| WELLS FARGO & COMPANY | 243,217 | ADIDAS | 1 |
| ARAMARK | 152,702 | ADVANCED EMISSIONS SOLUTIONS INC | 1 |
| KINDRED HEALTHCARE INC | 147,931 | AGNC INVESTMENT CORP | 1 |
| TARGET CORPORATION | 139,739 | AIR PRODUCTS AND CHEMICALS INC | 1 |
| MACY’S INC | 130,728 | AIR TRANSPORT SERVICES GROUP INC | 1 |
| MARRIOTT | 116,405 | ALLIN CORPORATION | 1 |
| EXPRESS INC | 106,123 | AMERICAN REALTY INVESTORS INC | 1 |
| THE SOUTHERN COMPANY | 105,905 | AMERICAS UNITED BANK | 1 |
| THE HOME DEPOT INC | 100,318 | AMKOR TECHNOLOGY INC | 1 |
There are 3153 unique tickers
Below table represents the frequency count of top 10 and bottom 10 tickers
| Ticker | Frequency | Ticker | Frequency |
|---|---|---|---|
| CRCM | 758,072 | ADDDF | 1 |
| WFC | 243,217 | ADES | 1 |
| MAR | 174,368 | AE | 1 |
| ARMK | 152,702 | AGNC | 1 |
| KND | 147,931 | AGTC | 1 |
| TGT | 139,739 | ALLN | 1 |
| M | 130,728 | AMGP | 1 |
| EXPR | 106,123 | AMKR | 1 |
| SO | 105,905 | AMPG | 1 |
| HD | 100,318 | APD | 1 |
Periodicity: Daily
Below graph shows the trend of job postings over the period April-2016 to March-2019.
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 25% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: 30-Mar-2019 | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 3
Number of Rows: 6,539,168
Date Range: 04-Mar-2019 to 30-Mar-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
There are no missing records present in the data.
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
There are no relevant continuous variables to plot histograms
Below plot below shows top 10 and bottom 10 job roles in the company
There are about 1,667 different job roles in the data
| Role | Frequency | Role | Frequency |
|---|---|---|---|
| Sales/Marketing (all) | 593,243 | Application Assistant | 3 |
| Manager | 505,026 | Apprentice Plumber | 3 |
| Engineer | 379,686 | Behavioral Health Tech | 3 |
| Sales | 335,743 | Care Team Member | 3 |
| Associate | 263,894 | Carpenter Helper | 3 |
| Team Member | 142,387 | Chief Analytics Officer | 3 |
| Driver | 140,769 | Cutting Technician | 3 |
| Assistant | 129,017 | Desktop Support Administrator | 3 |
| salesperson | 123,307 | Dietetic Technician | 3 |
| service tech/mechanic | 121,162 | Disaster Recovery Manager | 3 |
Time series is plotted against the number of job posted against the day of the month in the data
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 33.33% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: 30-Mar-2019 | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 3
Number of Rows: 13,607,028
Date Range: 05-Mar-2019 to 30-Mar-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
There are no missing records present in the data.
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
There are no relevant continuous variables to plot histograms
Below plot below shows top 10 and bottom 10 job tags in the company
There are about 1,846 different tags in the data
| Tag | Frequency | Tag | Frequency |
|---|---|---|---|
| sales | 459,128 | CouchDB | 2 |
| Team | 418,784 | Dispatch Systems | 2 |
| Operations | 324,657 | Mother Baby | 2 |
| Management | 295,872 | Neuropsychiatry | 2 |
| Design | 252,709 | Organ Donation | 2 |
| Engineering | 220,825 | PeopleCode | 2 |
| Hiring | 193,161 | PeopleSoft HRMS | 2 |
| Customer Service | 192,642 | Pulse Ox | 2 |
| Training | 161,920 | Solving Equations | 2 |
| Lead | 160,402 | appexchange | 3 |
Time series is plotted against the number of job posted against the day of the month in the data
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 20% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: 30-Mar-2019 | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 5
Number of Rows: 148,139,575
Date Range: 25-Apr-2016 to 30-Mar-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
There are no relevant continuous columns to plot histograms
Below plot below shows top 10 number of days for which listing was open
There are about 285 different number of days for which listings are open
Below graph shows the number of posts listed over 2016 to 2019
Below graph shows the number of posts removed over 2016 to 2019
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 50% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: 30-Mar-2019 | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 3
Number of Rows: 1,294,582
Date Range: 04-Mar-2019 to 30-Mar-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
There are no relevant continuous variables to plot histograms
There are about 298,762 different titles in the data
The table below shows Top 10 and Bottom 10 headline titles included in each job listings
| Title | Frequency | Title | Frequency |
|---|---|---|---|
| Assistant Manager | 4,788 | ‘Back Up’ Nanny Needed For 1 Child In Brooklyn | 2 |
| Sales Associate | 4,772 |
|
2 |
| Server | 4,212 |
|
2 |
| Delivery Driver | 3,768 |
|
2 |
| Store Manager | 2,964 |
|
2 |
| Dishwasher | 2,682 | -Plant Shift Supervisor - 2nd Shift Memphis, TN | 2 |
| Cook | 2,672 | -Senior Software Development Manager | 2 |
| Assistant Store Manager | 2,644 | !! Restaurant positions open !! | 2 |
| General Manager | 2,626 | !!! CAREGIVER PART TIME DAY SHIFT 6AM-2PM Starting Wages $15.. | 2 |
| Retail Sales Associate | 2,536 | !!! FULL TIME CAREGIVER PM Shift 2-10pm !!! | 2 |
Time series is plotted against the number of job posted against the day of the month in the data
Univariate Analysis involves the analysis of one variable at a time.
A Histogram visualizes the distribution of a numerical field over its continuous range of values. Each bar in a histogram represents the tabulated frequency at each interval/bin.
A Density Plot visualizes the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.
A QQ plot is a scatterplot created by plotting two sets of quantiles (theoretical and sample) against one another. The shape of the QQ plot indicates whether the data is normally distributed, skewed, or has a heavy tail.
Bivariate Analysis involves the analysis of two variables for the purpose of determining the empirical relationship between them. It explores the concept of the relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.
A correlation matrix displays the coefficient of correlation for every pair of variables present in the dataset. This allows you to see which pairs have the highest correlation. It can be used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.
Trend charts are simple and efficient graphical representations of time-series data. Monthly trend charts can often reveal seasonal trends for a variable while yearly trend charts show trends over a longer period.
Tresvista - s Data Quality (DQ) framework is designed to assess the data quality and data health. Data quality monitoring is performed on an ongoing basis to ensure sustainable data quality.
A Data Quality Dimension is a term used to describe a data quality measure that can relate to multiple data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. While there are multiple parameters on which a dataset can be assessed in terms of quality, we have identified the following 5 core dimensions for our assessment.
It is ensuring that enough data is available to end users and applications, when and where they need it for further analysis. This is particularly important, as many of the machine learning algorithms require enough data samples for training and testing/validating the models.
It is identifying the percentage of records with non-NULL values. It can also be termed as comprehensiveness. Missing or incomplete data can hamper the analysis and affect the interpretability of the insights.
It points out that there should be no data duplicates reported. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set.
It is the degree to which information is recent with the current period. It measures how - up-to-date information is, and whether it is correct despite possible time-related changes.
It refers to the data values in one column being consistent across the column. A strict definition of consistency specifies that two data values drawn from the same column must not conflict with each other (column level consistency).