Improvement and Evaluation of Data Consistency Metric CIL for Software Engineering Data Sets

Software data sets derived from actual software products and their development processes are widely used for project planning, management, quality assurance and process improvement, etc. Although it is demonstrated that certain data sets are not fit for these purposes, the data quality of data sets is often not assessed before using them. The principal reason for this is that there are not many metrics quantifying fitness of software development data. In that respect, this study makes an effort to fill in the void in literature by devising a new and efficient assessment method of data quality. To that end, we start as a reference from Case Inconsistency Level (CIL), which counts the number of inconsistent project pairs in a data set to evaluate its consistency. Based on a follow-up evaluation with a large sample set, we depict that CIL is not effective in evaluating the quality of certain data sets. By studying the problems associated with CIL and eliminating them, we propose an improved metric called Similar Case Inconsistency Level (SCIL). Our empirical evaluation with 54 data samples derived from six large project data sets shows that SCIL can distinguish between consistent and inconsistent data sets, and that prediction models for software development effort and productivity built from consistent data sets achieve indeed a relatively higher accuracy.

Improvement and Evaluation of Data Consistency Metric CIL for Software Engineering Data Sets

Gan M.;Yucel Z.;Monden A.

2022-01-01

Abstract

Software data sets derived from actual software products and their development processes are widely used for project planning, management, quality assurance and process improvement, etc. Although it is demonstrated that certain data sets are not fit for these purposes, the data quality of data sets is often not assessed before using them. The principal reason for this is that there are not many metrics quantifying fitness of software development data. In that respect, this study makes an effort to fill in the void in literature by devising a new and efficient assessment method of data quality. To that end, we start as a reference from Case Inconsistency Level (CIL), which counts the number of inconsistent project pairs in a data set to evaluate its consistency. Based on a follow-up evaluation with a large sample set, we depict that CIL is not effective in evaluating the quality of certain data sets. By studying the problems associated with CIL and eliminating them, we propose an improved metric called Similar Case Inconsistency Level (SCIL). Our empirical evaluation with 54 data samples derived from six large project data sets shows that SCIL can distinguish between consistent and inconsistent data sets, and that prediction models for software development effort and productivity built from consistent data sets achieve indeed a relatively higher accuracy.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2022
			
	Titolo della Rivista
	
				IEEE ACCESS
			
	N° Volume
	
				10
			
	DOI
	
				https://dx.doi.org/10.1109/ACCESS.2022.3188246
			
	Appare nelle tipologie:
	
				2.1 Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
j_19_ieee_access_improvement.pdf accesso aperto Tipologia: Versione dell'editore Licenza: Accesso gratuito (solo visione) Dimensione 2.43 MB Formato Adobe PDF Visualizza/Apri	2.43 MB	Adobe PDF	Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5079722

Citazioni

ND

2

2

social impact