Clustering Analysis with Embedding Vectors: An Application to Real Estate Market Delineation
DOI:
https://doi.org/10.46604/aiti.2021.8492Keywords:
clustering, categorical data, high-cardinality, entity embedding, market delineationAbstract
Although clustering analysis is a popular tool in unsupervised learning, it is inefficient for the datasets dominated by categorical variables, e.g., real estate datasets. To apply clustering analysis to real estate datasets, this study proposes an entity embedding approach that transforms categorical variables into vector representations. Three variants of a clustering algorithm, i.e., the clustering based on the traditional Euclidean distance, the Gower distance, and the embedding vectors, are applied to the land sales records to delineate the real estate market in Gwacheon-si, Gyeonggi province, South Korea. Then, the relevance of the resultant submarkets is evaluated using the root mean squared errors (RMSE) obtained from a hedonic pricing model. The results show that the RMSE in the embedding vector-based algorithm decreases substantially from 0.076-0.077 to 0.069. This study shows that the clustering algorithm empowered by embedding vectors outperforms the conventional algorithms, thereby enhancing the relevance of the delineated submarkets.
References
V. Goyal, G. Singh, O. Tiwari, S. Punia, and M. Kumar, “Intelligent Skin Cancer Detection Mobile Application Using Convolution Neural Network,” Journal of Advanced Research in Dynamical and Control Systems, vol. 11, no. 7, pp. 253-259, 2019.
A. Aggarwal, M. Alshehri, M. Kumar, P. Sharma, O. Alfarraj, and V. Deep, “Principal Component Analysis, Hidden Markov Model, and Artificial Neural Network Inspired Techniques to Recognize Faces,” Concurrency and Computation: Practice and Experience, vol. 33, no. 9, e6157, May 2021.
M. Alshehri, M. Kumar, A. Bhardwaj, S. Mishra, and J. Gyani, “Deep Learning Based Approach to Classify Saline Particles in Sea Water,” Water, vol. 13, no. 9, 1251, 2021.
A. Aggarwal, A. Rani, P. Sharma, M. Kumar, A. Shankar, and M. Alazab, “Prediction of Landsliding Using Univariate Forecasting Models,” Internet Technology Letters, in press.
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Cambridge: Morgan Kaufmann, 2016.
A. C. Goodman and T. G. Thibodeau, “Housing Market Segmentation and Hedonic Prediction Accuracy,” Journal of Housing Economics, vol. 12, no. 3, pp. 181-201, September 2003.
J. C. Gower, “A General Coefficient of Similarity and Some of Its Properties,” Biometrics, vol. 27, no. 4, pp. 857-871, December 1971.
L. R. Dice, “Measures of the Amount of Ecologic Association between Species,” Ecology, vol. 26, no. 3, pp. 297-302, July 1945.
P. Legendre and L. Legendre, Numerical Ecology, Burlington: Elsevier Science, 2012.
Z. Huang, “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283-304, 1998.
S. S. Khan and A. Ahmad, “Cluster Center Initialization Algorithm for K-Modes Clustering,” Expert Systems with Applications, vol. 40, no. 18, pp. 7444-7456, December 2013.
N. Sharma and N. Gaud, “K-Modes Clustering Algorithm for Categorical Data,” International Journal of Computer Applications, vol. 127, no. 17, pp. 1-6, October 2015.
C. Guo and F. Berkhahn, “Entity Embeddings of Categorical Variables,” https://arxiv.org/pdf/1604.06737.pdf, April 22, 2016.
V. Efthymiou, O. Hassanzadeh, M. Rodriguez-Muro, and V. Christophides, “Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings,” International Semantic Web Conference, pp. 260-277, October 2017.
J. Pennington, R. Socher, and C. D. Manning, “Glove: Global Vectors for Word Representation,” Conference on Empirical Methods in Natural Language Processing, pp. 1532-1543, October 2014.
O. Abdelwahab and A. Elmaghraby, “UofL at SemEval-2016 Task 4: Multi Domain Word2vec for Twitter Sentiment Classification,” 10th International Workshop on Semantic Evaluation, pp. 164-170, June 2016.
Z. Chen, Y. Huang, Y. Liang, Y. Wang, X. Fu, and K. Fu, “RGloVe: An Improved Approach of Global Vectors for Distributional Entity Relation Representation,” Algorithms, vol. 10, no. 2, 42, 2017.
M. Aydoğan and A. Karci, “Turkish Text Classification with Machine Learning and Transfer Learning,” International Artificial Intelligence and Data Processing Symp., pp. 1-6, September 2019.
J. Xie, R. Girshick, and A. Farhadi, “Unsupervised Deep Embedding for Clustering Analysis,” International Conference on Machine Learning, pp. 478-487, June 2016.
X. Guo, L. Gao, X. Liu, and J. Yin, “Improved Deep Embedded Clustering with Local Structure Preservation,” 26th International Joint Conference on Artificial Intelligence, pp. 1753-1759, August 2017.
C. Wu and R. Sharma, “Housing Submarket Classification: The Role of Spatial Contiguity,” Applied Geography, vol. 32, no. 2, pp. 746-756, March 2012.
B. Keskin and C. Watkins, “Defining Spatial Housing Submarkets: Exploring the Case for Expert Delineated Boundaries,” Urban Studies, vol. 54, no. 6, pp. 1446-1462, 2017.
S. Openshaw, “A Geographical Solution to Scale and Aggregation Problems in Region-Building, Partitioning and Spatial Modelling,” Transactions of the Institute of British Geographers, vol. 2, no. 4, pp. 459-472, 1977.
D. P. Claessens, S. Boonstra, and H. Hofmeyer, “Spatial Zoning for Better Structural Topology Design and Performance,” Advanced Engineering Informatics, vol. 46, 101162, October 2020.
R. M. Assunção, M. C. Neves, G. Câmara, and C. da Costa Freitas, “Efficient Regionalization Techniques for Socio‐Economic Geographical Units Using Minimum Spanning Trees,” International Journal of Geographical Information Science, vol. 20, no. 7, pp. 797-811, 2006.
W. Lin and Y. Li, “Parallel Regional Segmentation Method of High-Resolution Remote Sensing Image Based on Minimum Spanning Tree,” Remote Sensing, vol. 12, no. 5, 783, 2020.
Z. Cai, J. Wang, and K. He, “Adaptive Density-Based Spatial Clustering for Massive Data Analysis,” IEEE Access, vol. 8, pp. 23346-23358, 2020.
N. Jabeur, A. U. H. Yasar, E. Shakshuki, and H. Haddad, “Toward a Bio-Inspired Adaptive Spatial Clustering Approach for IoT Applications,” Future Generation Computer Systems, vol. 107, pp. 736-744, June 2020.
W. M. Rand, “Objective Criteria for the Evaluation of Clustering Methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846-850, December 1971.
P. J. Rousseeuw, “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 53-65, November 1987.
S. Eldridge, D. Ashby, C. Bennett, M. Wakelin, and G. Feder, “Internal and External Validity of Cluster Randomised Trials: Systematic Review of Recent Trials,” British Medical Journal, vol. 336, 876, April 2008.
M. Rezaei and P. Fränti, “Set Matching Measures for External Cluster Validity,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 8, pp. 2173-2186, August 2016.
X. Li, W. Liang, X. Zhang, S. Qing, and P. C. Chang, “A Cluster Validity Evaluation Method for Dynamically Determining the Near-Optimal Number of Clusters,” Soft Computing, vol. 24, no. 12, pp. 9227-9241, 2020.
S. S. Kumar, S. T. Ahmed, P. Vigneshwaran, H. Sandeep, and H. M. Singh, “Two Phase Cluster Validation Approach Towards Measuring Cluster Quality in Unstructured and Structured Numerical Datasets,” Journal of Ambient Intelligence and Humanized Computing, vol. 12, no. 7, pp. 7581-7594, 2021.
C. A. Lipscomb and M. C. Farmer, “Household Diversity and Market Segmentation within a Single Neighborhood,” The Annals of Regional Science, vol. 39, no. 4, pp. 791-810, December 2005.
Y. Tu, H. Sun, and S. M. Yu, “Spatial Autocorrelations and Urban Housing Market Segmentation,” The Journal of Real Estate Finance and Economics, vol. 34, no. 3, pp. 385-406, 2007.
Z. Liu, J. Cao, R. Xie, J. Yang, and Q. Wang, “Modeling Submarket Effect for Real Estate Hedonic Valuation: A Probabilistic Approach,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 7, pp. 2943-2955, July 2021.
KOSTAT, “Statistics Korea: Population and Households,” http://kostat.go.kr/portal/eng/pressReleases/8/1/index.board, 2020.
A. Koul, S. Ganju, and M. Kasam, Practical Deep Learning for Cloud, Mobile, and Edge: Real-World AI and Computer-Vision Projects Using Python, Keras, and TensorFlow, Sebastopol: O’Reilly Media, 2019.
A. Struyf, M. Hubert, and P. Rousseeuw, “Clustering in an Object-Oriented Environment,” Journal of Statistical Software, vol. 1, no. 4, pp. 1-30, February 1997.
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Hoboken: John Wiley & Sons, 2009.
S. C. Bourassa, F. Hamelink, M. Hoesli, and B. D. MacGregor, “Defining Housing Submarkets,” Journal of Housing Economics, vol. 8, no. 2, pp. 160-183, June 1999.
S. Rosen, “Hedonic Prices and Implicit Markets: Product Differentiation in Pure Competition,” Journal of Political Economy, vol. 82, no. 1, pp. 34-55, January-February, 1974.
S. Catma, “The Price of Coastal Erosion and Flood Risk: A Hedonic Pricing Approach,” Oceans, vol. 2, no. 1, pp. 149-161, March 2021.
P. M. Campos, J. S. Thompson, and J. P. Molina, “Effect of Irrigation Water Availability on the Value of Agricultural Land in Guanacaste, Costa Rica: A Hedonic Pricing Approach,” e-Agronegocios, vol. 7, no. 1, pp. 38-55, 2020.
D. Wackerly, W. Mendenhall, and R. L. Scheaffer, Mathematical Statistics with Applications, Belmont: Cengage Learning, 2014.
Published
How to Cite
Issue
Section
License
Submission of a manuscript implies: that the work described has not been published before that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication. Authors can retain copyright in their articles with no restrictions. is accepted for publication. Authors can retain copyright of their article with no restrictions.
Since Jan. 01, 2019, AITI will publish new articles with Creative Commons Attribution Non-Commercial License, under The Creative Commons Attribution Non-Commercial 4.0 International (CC BY-NC 4.0) License.
The Creative Commons Attribution Non-Commercial (CC-BY-NC) License permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.