The deduplication process is always given by a set
of manually labeled pairs. But in a very large datasets,
producing manually labeled pairs is a tedious process to
complete. So in this article, a two-stage sampling selection
procedure that reduces the set of pairs to tune the
deduplication process is proposed. T3S executes in two
stages. In the first stage a balanced subset of data are
produced for labeling. In the next stage, the redundant and
the duplicated data are removed and only the deduplicated
data are produced as the output.
Ashwini R. : PG Scholar, Department of Computer Science and Engineering,
Vels University, Chennai, India.
Sridevi S. : Assistant Professor, Department of Computer Science and Engineering,
Vels University, Chennai, India
Deduplication, FS-Dedup, T3S
The proposed T3S, a two-stage sampling strategy aims at
reducing the user labeling effort in large scale
deduplication tasks. In the first stage, T3S selects small
random subsamples of candidate pairs in different
fractions of datasets. In the second stage, subsamples are
analyzed incrementally to remove duplicated data. T3S
with synthetic and real datasets is evaluated and showed
that, in comparison with four modules, T3S is able to
considerably reduce user effort in the large scale
deduplication tasks.
[1] Guilherme Dal Bianco, Renata Galante, Marcos Andre
Goncalves, Sergio Canuto, and Carlos A. Heuser “A
Practical and Effective Sampling Selection Strategy for
Large Scale Deduplication” IEEE Transactions on
knowledge and data Engineering, Vol. 27, no. 9,
September 2015
[2] A. Arasu, M. Gotz, and R. Kaushik, “On active
learning of record matching packages,” in Proc. ACM
SIGMOD Int. Conf. Manage.Data, 2010, pp. 783–794.
[3] A. Arasu, C. R_e, and D. Suciu, “Large-scale
deduplication with constraints using dedupalog,” in
Proc. IEEE Int. Conf. Data Eng., 2009, pp. 952–963.
[4] R. J. Bayardo, Y. Ma, and R. Srikant, “Scaling up all
pairs similarity search,” in Proc. 16th Int. Conf. World
Wide Web, pp. 131–140, 2007.
[5] K. Bellare, S. Iyengar, A. G. Parameswaran, and V.
Rastogi, “Active sampling for entity matching,” in
Proc. 18th ACM SIGKDD Int. Conf. Knowl.
Discovery Data Mining, 2012, pp. 1131–1139.
[6] A. Beygelzimer, S. Dasgupta, and J. Langford,
“Importance weighted active learning,” in Proc. 26th
Annu. Int. Conf. Mach. Learn., pp. 49–56, 2009.
[7] S. Chaudhuri, V. Ganti, and R. Kaushik, “A primitive
operator for similarity joins in data cleaning,” in Proc.
22nd Int. Conf. Data Eng., p. 5, Apr. 2006.