BY
Number of Files Received: 6
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 93.38% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: 29-Jun-2019 | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 4
Number of Rows: 101,886,928
Date Range: 01-Jan-2012 to 29-Jun-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
There are no missing records present in the data.
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.
Column active_duration represents the average number of days each posting was active in a given grouping
Below table represents the frequency count of top 10 and bottom 10 company names
There are about 37,264 different company names available in the data
| Company Name | Frequency | Company Name | Frequency |
|---|---|---|---|
| Cardinal Health | 13,684 | Greater Seattle Chamber of Commerce | 2,716 |
| Lyft | 10,968 | Hocking Technical College | 2,717 |
| Target | 10,953 | New Naschitti Elementary School | 2,717 |
| Cardinal Logistics | 10,949 | Santa Clara University | 2,717 |
| Mesilla Valley Transportation | 10,943 | Teletronics Technology Corporation | 2,717 |
| UPS | 10,941 | Wheaton College | 2,717 |
| AXA | 8,227 | First Atlantic Health Care | 2,718 |
| St. Joseph Health System | 8,226 | Gwinnett County Public Schools | 2,718 |
| Yavapai Regional Medical Center | 8,226 | Hancock Regional Hospital | 2,718 |
| Mosaic | 8,225 | Tahoe Truckee Unified School District | 2,718 |
Periodicity: Daily
Below graph shows the frequency trend over the period January 2012 to June 2019
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 100% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 4 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | No date columns in data | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 10
Number of Rows: 61,209
Date Range: 24-Feb-2016 to 31-Aug-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
There are no missing records present in the data.
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.
There are no rlevant columns to plot histogram
Below table represents the frequency count of top 10 and bottom 10 company names
There are about 56,638 different companies available in the data
| Company Name | Frequency | Company Name | Frequency |
|---|---|---|---|
| Mesilla Valley Transportation | 10 | ’ike Group | 1 |
| CDW | 9 | !ndigo | 1 |
| Averitt Express | 7 | [24]7 Inc | 1 |
| Pizza Hut | 7 | [X+1] | 1 |
| Ruan | 7 | 01 Communique Laboratory Inc | 1 |
| UPS | 7 | 1-800-FLOWERS.COM, Inc | 1 |
| Western Express | 7 | 1-800-GOT-JUNK | 1 |
| Advantage Solutions | 6 | 1-800-Sweeper/800Sweeper, LLC | 1 |
| Celadon | 6 | 1-800 CONTACTS | 1 |
| Express Scripts | 6 | 1-800 CONTACTS, INC | 1 |
Below table represents the frequency count of top 10 and bottom 10 lei
There are about 4,278 different lei available in the data
| lei | Frequency | lei | Frequency |
|---|---|---|---|
| 4KF48RN45X1OO8UBLY20 | 29 | 01KWVG908KE7RKPTNP46 | 1 |
| 5493000GH5DTFC8LLR93 | 11 | 02CBKVAOND0BEIOF0V84 | 1 |
| 4YV9Y5M8S0BRK1RP0397 | 7 | 03D0JEWFDFUS0SEEKG89 | 1 |
| 5493008TXYN3II3PU369 | 7 | 04Y1L40RYNUCL020XS57 | 1 |
| 549300VHDC555R46LM46 | 6 | 05MQKGBWLLX7RPPDO189 | 1 |
| GTJS1N8S8I28A7L4WG97 | 6 | 06BTX5UWZD0GQ5N5Y745 | 1 |
| 225YDZ14ZO8E1TXUSU86 | 5 | 08IRJODWFYBI7QWRGS31 | 1 |
| 5493007JDSMX8Z5Z1902 | 5 | 0IDE18EMH1CUKQCUYE69 | 1 |
| 54930080C93RVZRSDV26 | 5 | 0JRPR7R1EOV4Z2M47L22 | 1 |
| 549300A4FSM5CB5M5D32 | 5 | 0M6M8M9HXLW8D3AU0N32 | 1 |
Below table represents the frequency count of top 10 and bottom 10 lei
There are about 23,749 different open perm ids available in the data
| perm_id | Frequency | perm_id | Frequency |
|---|---|---|---|
| 4295905360 | 29 | -5040258481 | 1 |
| 4297833085 | 17 | -5059003739 | 1 |
| 4297113611 | 15 | 102100625 | 1 |
| 4297213968 | 14 | 114840374 | 1 |
| 5038021810 | 13 | 21521133901 | 1 |
| 5040952276 | 13 | 296469794 | 1 |
| 5044197001 | 13 | 4294988003 | 1 |
| 4295349307 | 12 | 4295001381 | 1 |
| 4297385008 | 11 | 4295011269 | 1 |
| 4295904886 | 9 | 4295012519 | 1 |
Below table represents the frequency count of top 10 and bottom 10 naics codes
There are about 916 different naics codes available in the data
| NAICS Code | Frequency | NAICS Code | Frequency |
|---|---|---|---|
| 611110 | 1031 | 111211 | 1 |
| 622110 | 664 | 111219 | 1 |
| 999990 | 499 | 111320 | 1 |
| 611310 | 405 | 111421 | 1 |
| 541511 | 301 | 112111 | 1 |
| 443142 | 292 | 112340 | 1 |
| 621111 | 273 | 114111 | 1 |
| 522110 | 253 | 115310 | 1 |
| 541512 | 230 | 211112 | 1 |
| 921120 | 212 | 212230 | 1 |
Periodicity: Daily
Below graph shows the frequency trend over the period February 2012 to August 2019.
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 93.2% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: 29-Jun-2019 | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 4
Number of Rows: 102,340,815
Date Range: 01-Jan-2012 to 29-Jun-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.
Column created_job_count represents how many new job postings were created for the given grouping
Note: Figure 2 is a zoomed version of figure 1 (containing values greater than 2 and lesser than 20)
Below table represents the frequency count of top 10 and bottom 10 company names
There are about 36,679 differentcomapny names available in the data
| Company Name | Frequency | Company Name | Frequency |
|---|---|---|---|
| Cardinal Health | 13,395 | Healthcare Risk Advisors | 2,209 |
| UPS | 11,163 | Milan Laser Hair Removal | 2,209 |
| Schneider | 11,045 | Zurchers | 2,211 |
| Lyft | 10,782 | Acme Brick Tile & More | 2,215 |
| Target | 10,662 | Horizon Connects | 2,216 |
| Cardinal Logistics | 10,608 | Blue Sky Pest Control | 2,217 |
| Mesilla Valley Transportation | 10,417 | Hacker USA | 2,218 |
| Averitt Express | 9,184 | The Third Floor Inc | 2,218 |
| Community Medical Centers | 9,140 | Ministry of Social Development | 2,220 |
| St. Joseph Health System | 8,914 | Stratton Amenities | 2,221 |
Periodicity: Daily
Below graph shows the frequency trend over the period January 2012 to June 2019.
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 92.33% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: 29-Jun-2019 | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 4
Number of Rows: 100,560,809
Date Range: 01-Jan-2012 to 29-Jun-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.
Column deleted_job_count represents how many job postings were removed for the given grouping
Note: Figure 2 is a zoomed version of figure 1 (containing values greater than 2 and lesser than 20)
Below table represents the frequency count of top 10 and bottom 10 company names
There are about 36,038 different company names available in the data
| Company Name | Frequency | Company Name | Frequency |
|---|---|---|---|
| Cardinal Health | 13,704 | Atlanta Braves MLB | 2,706 |
| Target | 10,995 | Barge Waggoner Sumner & Cannon | 2,706 |
| Mesilla Valley Transportation | 10,937 | CGI Group Inc | 2,706 |
| UPS | 10,925 | Fort Miller | 2,706 |
| Lyft | 10,923 | Worley Catastrophe Response | 2,707 |
| Schneider | 10,913 | Zitter Health Insights | 2,707 |
| Cardinal Logistics | 10,880 | Alverno Laboratories | 2,708 |
| USA Truck | 8,262 | Attic Angel | 2,708 |
| Ruan | 8,259 | Cabot Corporation | 2,708 |
| XPO Logistics | 8,259 | CBRE UK | 2,708 |
Periodicity: Daily
Below graph shows the frequency trend over the period January 2012 to June 2019.
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 100% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 2 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | No date columns in data | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 9
Number of Rows: 66,402
Date Range: 26-Nov-1968 to 29-Jul-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
There are no missing records present in the data.
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.
There are no rlevant columns to plot histogram
Below table represents the frequency count of top 10 and bottom 10 company ids
There are about 0 different companies IDs available in the data
| Company ID | Frequency | Company ID | Frequency |
|---|---|---|---|
| 32373 | 36 | 10007 | 1 |
| 14034 | 35 | 1015 | 1 |
| 3580 | 35 | 10291 | 1 |
| 3745 | 34 | 10297 | 1 |
| 43173 | 34 | 10323 | 1 |
| 54945 | 34 | 1047 | 1 |
| 56056 | 28 | 10584 | 1 |
| 10736 | 25 | 1077 | 1 |
| 19352 | 25 | 10797 | 1 |
| 6568 | 25 | 10823 | 1 |
Below table represents the frequency count of top 10 and bottom 10 stock tickers
There are about 22,124 different stock tickers available in the data
| Stock Ticker | Frequency | Stock Ticker | Frequency |
|---|---|---|---|
| HCA | 422 | 000046 | 1 |
| IBM | 350 | 000050 | 1 |
| WMT | 252 | 000120 | 1 |
| UNH | 244 | 000150 | 1 |
| XRX | 225 | 000166 | 1 |
| BRK | 224 | 000660 | 1 |
| BRKB | 224 | 000725 | 1 |
| ORCL | 192 | 000728 | 1 |
| UTX | 168 | 001 | 1 |
| JNJ | 160 | 002371 | 1 |
Below table represents the frequency count of top 10 and bottom 10 lei
There are about 78 different open perm ids available in the data
| Stock Exchange Country | Frequency | Stock Exchange Country | Frequency |
|---|---|---|---|
| US | 19,203 | CY | 1 |
| DE | 17,495 | EG | 1 |
| GB | 8,175 | IS | 1 |
| MX | 4,382 | JM | 1 |
| CA | 3,082 | KW | 1 |
| AT | 2,801 | MU | 1 |
| CH | 2,664 | OM | 1 |
| FR | 1,123 | PK | 1 |
| JP | 993 | RS | 1 |
| CL | 844 | SK | 1 |
Below table represents the frequency count of top 10 and bottom 10 naics codes
There are about 113 different naics codes available in the data
| Stock Exchange Name | Frequency | Stock Exchange Name | Frequency |
|---|---|---|---|
| FRA | 11,969 | BEL | 1 |
| NYS | 9,355 | BRA | 1 |
| NAS | 9,056 | BURG | 1 |
| LON | 8,172 | CAI | 1 |
| MEX | 4,382 | CYS | 1 |
| BER | 2,864 | DAR | 1 |
| WBO | 2,801 | DFM | 1 |
| SWX | 2,613 | ICE | 1 |
| TSX | 1,509 | IEXG | 1 |
| TSE | 1,507 | IST | 1 |
Periodicity: Daily
Below graph shows the frequency trend over the period January 2012 to August 2019.
Parameter definitions are available in Appendix
| Parameter | Yes | No | Comments |
|---|---|---|---|
| Duplicate rows > 10% ? | ✓ | 92.61% Unique Rows | |
| Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
| Most recent updates is before 6 months ago ? | ✓ | Latest Date: 29-Jun-2019 | |
| Data contain PII (Personally Identificable Information) ? | ✓ |
Number of Columns: 4
Number of Rows: 101,889,662
Date Range: 01-Jan-2012 to 29-Jun-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.
Column unique_active_job_count represents how many unique job postings were active for the given grouping
Below table represents the frequency count of top 10 and bottom 10 company names
There are about 36,540 different company names available in the data
| Company Name | Frequency | Company Name | Frequency |
|---|---|---|---|
| Cardinal Health | 13,691 | Casey’s Cupcakes | 2,712 |
| Lyft | 10,954 | Castlight Health | 2,712 |
| Mesilla Valley Transportation | 10,935 | DirectMedica | 2,712 |
| Target | 10,935 | Transaction Network Services | 2,712 |
| Cardinal Logistics | 10,931 | Capita | 2,713 |
| UPS | 10,916 | City of Carlsbad | 2,713 |
| Knight Transportation | 8,223 | Jaunt | 2,713 |
| Constellation Brands | 8,222 | Nutmeg | 2,713 |
| Schneider | 8,222 | Puratos | 2,713 |
| St. Joseph Health System | 8,217 | US Storage Centers | 2,713 |
Periodicity: Daily
Below graph shows the frequency trend over the period January 2012 to June 2019.
Univariate Analysis involves the analysis of one variable at a time.
A Histogram visualizes the distribution of a numerical field over its continuous range of values. Each bar in a histogram represents the tabulated frequency at each interval/bin.
A Density Plot visualizes the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.
A QQ plot is a scatterplot created by plotting two sets of quantiles (theoretical and sample) against one another. The shape of the QQ plot indicates whether the data is normally distributed, skewed, or has a heavy tail.
Bivariate Analysis involves the analysis of two variables for the purpose of determining the empirical relationship between them. It explores the concept of the relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.
A correlation matrix displays the coefficient of correlation for every pair of variables present in the dataset. This allows you to see which pairs have the highest correlation. It can be used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.
Trend charts are simple and efficient graphical representations of time-series data. Monthly trend charts can often reveal seasonal trends for a variable while yearly trend charts show trends over a longer period.
Tresvista - s Data Quality (DQ) framework is designed to assess the data quality and data health. Data quality monitoring is performed on an ongoing basis to ensure sustainable data quality.
A Data Quality Dimension is a term used to describe a data quality measure that can relate to multiple data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. While there are multiple parameters on which a dataset can be assessed in terms of quality, we have identified the following 5 core dimensions for our assessment.
It is ensuring that enough data is available to end users and applications, when and where they need it for further analysis. This is particularly important, as many of the machine learning algorithms require enough data samples for training and testing/validating the models.
It is identifying the percentage of records with non-NULL values. It can also be termed as comprehensiveness. Missing or incomplete data can hamper the analysis and affect the interpretability of the insights.
It points out that there should be no data duplicates reported. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set.
It is the degree to which information is recent with the current period. It measures how - up-to-date information is, and whether it is correct despite possible time-related changes.
It refers to the data values in one column being consistent across the column. A strict definition of consistency specifies that two data values drawn from the same column must not conflict with each other (column level consistency).