Analysis of Characteristics of Complaints on Parenting Q&A Sites Using pLSA and Data Augmentation
DOI:
https://doi.org/10.46604/aiti.2025.14700Keywords:
Q&A site, complaint, NLP, data augmentationAbstract
This study investigates the classification and clustering of complaints on a Japanese parenting Q&A site, aiming to identify meaningful patterns from limited labeled data. To address data scarcity, generative AI was utilized for data augmentation through prompts that reflected authentic parenting frustrations, with synthetic data validated by comparing classification performance under varying proportions of generated content. Complaint texts were vectorized using Bag-of-Words, Doc2Vec, and Sparse Composite Document Vectors, providing multiple levels of semantic representation. LightGBM was used as the classifier, and F1 scores measured performance. Clustering of predicted complaints employed probabilistic Latent Semantic Analysis, with topic numbers selected via Bayesian Information Criterion. Six distinct themes emerged, including childcare stress and family conflict. Incorporating generated data improved the F1 score from 0.824 to 0.865. The findings highlight the potential of generative AI to augment low-resource datasets and demonstrate the effectiveness of context-aware embeddings and probabilistic clustering in structuring real-world text data.
References
T. Kozue and A. Munakata, “Monologues on Twitter: Does the Relieved Stress Exceed the Social Networking Fatigue,” Proceedings of the Annual Meeting of the Japanese Psychological Association, no. 83, article no. 71, 2019.
T. Shimada and A. Sakurai, “Recognition of Questions Seeking Sympathy in Community QA Sites,” Journal of Japanese Society for Fuzzy Theory and Intelligent Informatics, vol. 29, no. 4, pp. 611-618, 2017.
I. Ito, H. Muranoi, and N. Shibata, “Mega Data Analysis of Grumbles on Social Networking Service,” Bulletin of the Faculty of Education, Ibaraki University (Educational Science), Special Issue, pp. 389-406, 2014.
K. Ito, T. Murayama, S. Yada, S. Wakamiya, and E. Aramaki, “Construction of a Japanese 'Guchi' Dataset Considering Targets,” Proceedings of the 28th Annual Meeting of the Association for Natural Language Processing, Paper No. F8-4, 2022.
D. Preoţiuc-Pietro, M. Gaman, and N. Aletras, “Automatically Identifying Complaints in Social Media,” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5008-5019, 2019.
M. Jin and N. Aletras, “Modeling the Severity of Complaints in Social Media,” Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2264-2274, 2021.
M. Fang, S. Zong, J. Li, X. Dai, S. Huang, and J. Chen, “Analyzing the Intensity of Complaints on Social Media,” Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1742-1754, 2022.
X. Tian, I. Vertommen, L. Tsiami, P. van Thienen, and S. Paraskevopoulos, “Automated Customer Complaint Processing for Water Utilities Based on Natural Language Processing—Case Study of a Dutch Water Utility,” Water, vol. 14, no. 4, article no. 674, 2022.
G. Alarifi, M. F. Rahman, and M. S. Hossain, “Prediction and Analysis of Customer Complaints Using Machine Learning Techniques,” International Journal of E-Business Research, vol. 19, no. 1, pp. 1-25, 2023.
M. Jin and N. Aletras, “Complaint Identification in Social Media with Transformer Networks,” Proceedings of the 28th International Conference on Computational Linguistics, pp. 1765-1771, 2020.
J. Wang, J. Lai, and Y. Lin, “Social media analytics for mining customer complaints to explore product opportunities,” Computers & Industrial Engineering, vol.178, article no. 109104, 2023.
Y. Zhang, R. Jin, and Z. H. Zhou, “Understanding Bag-of-Words Model: A Statistical Framework,” International Journal of Machine Learning and Cybernetics, vol. 1, pp. 43-52, 2010.
S. Martinčić-Ipšić, T. Miličić, and L. Todorovski, “The Influence of Feature Representation of Text on the Performance of Document Classification,” Applied Sciences, vol. 9, no. 4, article no. 743, 2019.
Q. V. Le and T. Mikolov, “Distributed Representations of Sentences and Documents,” Proceedings of the 31st International Conference on Machine Learning (ICML’24), vol. 32, no. 2, pp. 1188-1196, 2014.
Q. Chen and M. Sokolova, “Specialists, Scientists, and Sentiments: Word2Vec and Doc2Vec in Analysis of Scientific and Medical Texts,” SN Computer Science, vol. 2, article no. 414, 2021.
D. Mekala, V. Gupta, B. Paranjape, and H. Karnick, “SCDV: Sparse Composite Document Vectors Using Soft Clustering over Distributional Representations,” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 659–669, 2017.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Proceedings of the International Conference on Learning Representations (ICLR) Workshop, 2013
V. Gupta, A. Saw, P. Nokhiz, H. Gupta, and P. Talukdar, “Improving Document Classification with Multi-Sense Embeddings,” Proceedings of the 24th European Conference on Artificial Intelligence (ECAI 2020), pp. 324-331, 2020.
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree,” Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3146-3154, 2017.
T. Hofmann, “Probabilistic Latent Semantic Analysis,” Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI1999), pp. 289-296, 1999.
T. Yang, G. Kumoi, H. Yamashita, and M. Goto, “Transfer Learning Based on Probabilistic Latent Semantic Analysis for Analyzing Purchase Behavior Considering Customers’ Membership Stages,” Journal of Japan Industrial Management Association, vol. 73, no. 2E, pp. 160-175, 2022.
G. Schwarz, “Estimating the Dimension of a Model,” The Annals of Statistics, vol. 6, no. 2, pp. 461-464, 1978.
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Tomoki Yoshimi, Takashi Namatame

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Submission of a manuscript implies: that the work described has not been published before that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication. Authors can retain copyright in their articles with no restrictions. is accepted for publication. Authors can retain copyright of their article with no restrictions.
Since Jan. 01, 2019, AITI will publish new articles with Creative Commons Attribution Non-Commercial License, under The Creative Commons Attribution Non-Commercial 4.0 International (CC BY-NC 4.0) License.
The Creative Commons Attribution Non-Commercial (CC-BY-NC) License permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.
