How does correlation structure differ between real and fabricated data-sets?
- Additional Document Info
- View All
BACKGROUND: Misconduct in medical research has been the subject of many papers in recent years. Among different types of misconduct, data fabrication might be considered as one of the most severe cases. There have been some arguments that correlation coefficients in fabricated data-sets are usually greater than that found in real data-sets. We aim to study the differences between real and fabricated data-sets in term of the association between two variables. METHOD: Three examples are presented where outcomes from made up (fabricated) data-sets are compared with the results from three real data-sets and with appropriate simulated data-sets. Data-sets were made up by faculty members in three universities. The first two examples are devoted to the correlation structures between continuous variables in two different settings: first, when there is high correlation coefficient between variables, second, when the variables are not correlated. In the third example the differences between real data-set and fabricated data-sets are studied using the independent t-test for comparison between two means. RESULTS: In general, higher correlation coefficients are seen in made up data-sets compared to the real data-sets. This occurs even when the participants are aware that the correlation coefficient for the corresponding real data-set is zero. The findings from the third example, a comparison between means in two groups, shows that many people tend to make up data with less or no differences between groups even when they know how and to what extent the groups are different. CONCLUSION: This study indicates that high correlation coefficients can be considered as a leading sign of data fabrication; as more than 40% of the participants generated variables with correlation coefficients greater than 0.70. However, when inspecting for the differences between means in different groups, the same rule may not be applicable as we observed smaller differences between groups in made up compared to the real data-set. We also showed that inspecting the scatter-plot of two variables can be considered as a useful tool for uncovering fabricated data.
has subject area