Effective Sampling Selection Strategy with Reduced Effort Implied, In Tuning Large Scale Deduplication

Abstract
Authors
Keywords
Conclusion
References

The deduplication process is always given by a set of manually labeled pairs. But in a very large datasets, producing manually labeled pairs is a tedious process to complete. So in this article, a two-stage sampling selection procedure that reduces the set of pairs to tune the deduplication process is proposed. T3S executes in two stages. In the first stage a balanced subset of data are produced for labeling. In the next stage, the redundant and the duplicated data are removed and only the deduplicated data are produced as the output.

Published In : IJCSN Journal Volume 5, Issue 3

Date of Publication : June 2016

Pages : 523-525

Figures :05

Tables : --

Publication Link : Effective Sampling Selection Strategy with Reduced Effort Implied, In Tuning Large Scale Deduplication

Ashwini R. : PG Scholar, Department of Computer Science and Engineering, Vels University, Chennai, India.

Sridevi S. : Assistant Professor, Department of Computer Science and Engineering, Vels University, Chennai, India

Deduplication, FS-Dedup, T3S

The proposed T3S, a two-stage sampling strategy aims at reducing the user labeling effort in large scale deduplication tasks. In the first stage, T3S selects small random subsamples of candidate pairs in different fractions of datasets. In the second stage, subsamples are analyzed incrementally to remove duplicated data. T3S with synthetic and real datasets is evaluated and showed that, in comparison with four modules, T3S is able to considerably reduce user effort in the large scale deduplication tasks.

[1] Guilherme Dal Bianco, Renata Galante, Marcos Andre Goncalves, Sergio Canuto, and Carlos A. Heuser “A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication” IEEE Transactions on knowledge and data Engineering, Vol. 27, no. 9, September 2015 [2] A. Arasu, M. Gotz, and R. Kaushik, “On active learning of record matching packages,” in Proc. ACM SIGMOD Int. Conf. Manage.Data, 2010, pp. 783–794. [3] A. Arasu, C. R_e, and D. Suciu, “Large-scale deduplication with constraints using dedupalog,” in Proc. IEEE Int. Conf. Data Eng., 2009, pp. 952–963. [4] R. J. Bayardo, Y. Ma, and R. Srikant, “Scaling up all pairs similarity search,” in Proc. 16th Int. Conf. World Wide Web, pp. 131–140, 2007. [5] K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi, “Active sampling for entity matching,” in Proc. 18th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2012, pp. 1131–1139. [6] A. Beygelzimer, S. Dasgupta, and J. Langford, “Importance weighted active learning,” in Proc. 26th Annu. Int. Conf. Mach. Learn., pp. 49–56, 2009. [7] S. Chaudhuri, V. Ganti, and R. Kaushik, “A primitive operator for similarity joins in data cleaning,” in Proc. 22nd Int. Conf. Data Eng., p. 5, Apr. 2006.