中国人文社会科学核心期刊华东师范大学学报(教育科学版) ›› 2025, Vol. 43 ›› Issue (5): 30-43.doi: 10.16382/j.cnki.1000-5560.2025.05.003
• 教育数字化转型:学习与多智能体(特约主持人:顾小清) • 上一篇 下一篇
郑隆威1, 贺安娜2, 齐长永2, 胡碧皓2, 顾小清3, 洪道诚2
出版日期:2025-05-01
发布日期:2025-04-21
基金资助:Longwei Zheng1, Anna He2, Changyong Qi2, Bihao Hu2, Xiaoqing Gu3, Daocheng Hong2
Online:2025-05-01
Published:2025-04-21
摘要:
该研究将出声思维法与大语言模型相结合,提出“认知回响”这一新方法,补充了传统的同步出声思维法,以解决数据采集干扰思维过程的局限性。研究设计了一种能够模拟不同学生认知过程的角色智能体,并基于学习情景的再现、学习经历的重构及上传,构建了一个智能体训练框架。与传统的提示工程方法相比,该框架通过真实学习记录生成虚拟学习经历,使不同智能体能够更加准确地模拟各类学生的认知反应。研究通过对少量数据的微调训练,验证了学生智能体在认知模拟方面的潜力。结果表明,各类学生智能体能够从存量学习数据中自主获取学习经验,并基于此提供有效思维报告。这一方法可应用基于模拟的决策和培训中,这将有助于降低教育创新的成本和风险。
郑隆威, 贺安娜, 齐长永, 胡碧皓, 顾小清, 洪道诚. 认知回响:学习者智能体的出声思维研究[J]. 华东师范大学学报(教育科学版), 2025, 43(5): 30-43.
Longwei Zheng, Anna He, Changyong Qi, Bihao Hu, Xiaoqing Gu, Daocheng Hong. Cognitive Echo: The Think-aloud Protocol for Simulated Student Agents[J]. Journal of East China Normal University(Educationa, 2025, 43(5): 30-43.
表 1
评估学习者角色智能体的提示语"
| You will be given a response generated by an AI model attempting to simulate a student’s cognitive process while solving a math problem. Your task is to evaluate the response based on the {Knowledge Point Accuracy, Logical Coherence, Error Analysis, Detail Level, Consistency of Answering Style} dimension using the specific criterion provided below: Problem Description: {problem_description} AI Model’s Response: {model_response} Evaluation Criterion: |
| (1)知识准确性 |
| Knowledge Point Accuracy (1-5): Does the response correctly apply relevant knowledge points to the problem? Evaluate the correctness and completeness of the knowledge used in the response. Evaluation Steps: 1. Review the response and identify the key knowledge points that should have been applied to solve the problem. 2. Compare the response to the expected knowledge points and assess whether the AI model correctly applied them. Consider any inaccuracies or omissions in the application of knowledge. 3. Assign a score from 1 (lowest) to 5 (highest) based on the accuracy of the knowledge application. A score of 5 indicates that the model accurately and completely applied all relevant knowledge points. 4. Provide a brief justification for the score, explaining any errors, omissions, or strengths observed in the response. 5. Finally, print the score on a new line, followed by your rationale. |
| (2)逻辑连贯性 |
| Logical Coherence (1-5): Is the response logically consistent and free from contradictions? Does it follow a rational progression of ideas that leads to the solution? Evaluation Steps: 1. Review the response and identify the logical flow of ideas and steps. 2. Check for any logical inconsistencies, contradictions, or gaps in reasoning within the response. 3. Assign a score from 1 (lowest) to 5 (highest) based on the coherence of the logic. A score of 5 indicates a fully consistent and logical response with a clear progression of ideas. 4. Provide a brief justification for the score, explaining any logical errors or strengths observed in the response. 5. Finally, print the score on a new line, followed by your rationale. |
| (3)错误原因 |
| Error Analysis (1-5): If there are mistakes, does the model recognize and explain them effectively? Assess the model’s ability to diagnose and correct errors in the response. Evaluation Steps: 1. Review the response and identify any errors made by the AI model. 2. Evaluate whether the model recognizes these errors and provides a reasonable explanation or correction. 3. Assign a score from 1 (lowest) to 5 (highest) based on the model’s ability to analyze and correct errors. A score of 5 indicates that the model effectively diagnoses and corrects any mistakes. 4. Provide a brief justification for the score, discussing how well the model handled errors or missed opportunities for correction. 5. Finally, print the score on a new line, followed by your rationale. |
| (4)细节水平 |
| Detail Level (1-5): Is the response detailed enough to fully explain the problem and its solution? Does it provide sufficient context and supporting information? Evaluation Steps: 1. Review the response and assess the depth of explanation provided by the AI model. 2. Check if the response includes all necessary details, context, and supporting information to thoroughly address the problem. 3. Assign a score from 1 (lowest) to 5 (highest) based on the level of detail. A score of 5 indicates that the response is comprehensive and covers all aspects of the problem thoroughly. 4. Provide a brief justification for the score, highlighting any strengths or weaknesses in the level of detail. 5. Finally, print the score on a new line, followed by your rationale. |
| (5)回答风格一致性 |
| Consistency of Answering Style (1-5): Is the style of the response consistent across multiple turns? Does it maintain a coherent tone and approach that is appropriate for the student’s ability level? Evaluation Steps: 1. Review the response and evaluate the consistency of style, tone, and approach across the entire answer. 2. Determine if the style is appropriate for the student’s ability level and remains consistent throughout. 3. Assign a score from 1 (lowest) to 5 (highest) based on the consistency of the answering style. A score of 5 indicates a consistent and appropriate style throughout the response. 4. Provide a brief justification for the score, noting any variations in style or inconsistencies observed. 5. Finally, print the score on a new line, followed by your rationale. |
| Alshammari, T., Alhadreti, O., & Mayhew, P. (2015). When to ask participants to think aloud: A comparative study of concurrent and retrospective think-aloud methods. International Journal of Human Computer Interaction, 6 (3), 48- 64. | |
| Bates, J. (1994). The role of emotion in believable agents. Communications of the ACM, 37 (7), 122- 125. | |
| Beach, P., & Willows, D. (2017). Understanding Teachers' Cognitive Processes during Online Professional Learning: A Methodological Comparison. Online learning, 21(1), 60—84. | |
| Biri, S. K., Kumar, S., Panigrahi, M., Mondal, S., Behera, J. K., & Mondal, H. (2023). Assessing the utilization of large language models in medical education: Insights from undergraduate medical students. Cureus, 15(10). | |
| Brooks, R. A., Breazeal, C., Marjanović, M., Scassellati, B., & Williamson, M. M. (1998). The Cog project: Building a humanoid robot. In International workshop on computation for metaphors, analogy, and agents (pp. 52-87). Berlin, Heidelberg: Springer Berlin Heidelberg. | |
| Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877- 1901. | |
| Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., . . & Xie, X. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3), 1—45. | |
| Charters, E. (2003). The use of think-aloud methods in qualitative research an introduction to think-aloud methods. Brock Education Journal, 12 (2), 68- 82. | |
| Chiang, W. -L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2023, March). Vicuna: An open-source chatbot impressing GPT-4 with 90% ChatGPT quality. LMSYS. https://lmsys.org/blog/2023-03-30-vicuna/. | |
| Chrysafiadi, K., & Virvou, M. (2013). Student modeling approaches: A literature review for the last decade. Expert Systems with Applications, 40 (11), 4715- 4729. | |
| Chu, S., Kim, J., & Yi, M. (2024). Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation. arXiv preprint arXiv: 2409.07355. | |
| Davis, J. N., & Bistodeau, L. (1993). How Do L1 and L2 Reading Differ? Evidence from Think Aloud Protocols. The Modern Language Journal, 77 (4), 459- 472. | |
| Ding, Q. (2023). Unraveling the landscape of large language models: a systematic review and future perspectives. Journal of Electronic Business & Digital Economics, 3 (1), 3- 19. | |
| Doi, T. (2021). Usability Textual Data Analysis: a formulaic coding think-aloud protocol method for usability evaluation. Applied Sciences, 11(15), 7047. | |
| Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., & Tang, J. (2022). GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 320–335). | |
| Eccles, D. W., & Arsal, G. (2017). The think aloud method: what is it and how do I use it?. Qualitative Research in Sport, Exercise and Health, 9(4), 514—531. | |
| Ercikan, K., Arım, R., Law, D., Domene, J., Gagnon, F., & Lacroix, S. (2010). Application of think aloud protocols for examining and confirming sources of differential item functioning identified by expert reviews. Educational Measurement Issues and Practice, 29 (2), 24- 35. | |
| Ericsson, K. A., & Simon, H. A. (1980). Verbal reports as data. Psychological Review, 87 (3), 215- 251. | |
| Fan, M., Lin, J., Chung, C., & Truong, K. N. (2019). Concurrent think-aloud verbalizations and usability problems. ACM Transactions on Computer-Human Interaction (TOCHI), 26 (5), 1- 35. | |
| Fonteyn, M. E., Kuipers, B., & Grobe, S. J. (1993). A description of think aloud method and protocol analysis. Qualitative health research, 3 (4), 430- 441. | |
| Fox, M. C., Ericsson, K. A., & Best, R. (2011). Do procedures for verbal reporting of thinking have to be reactive? A meta-analysis and recommendations for best reporting methods. Psychological bulletin, 137 (2), 316. | |
| Frei-Landau, R., & Levin, O. (2022). The virtual Sim (HU) lation model: Conceptualization and implementation in the context of distant learning in teacher education. Teaching and Teacher Education, 117, 103798. | |
| Heirweg, S., De Smul, M., Devos, G., & Van Keer, H. (2019). Profiling upper primary school students' self-regulated learning through self-report questionnaires and think-aloud protocol analysis. Learning and Individual Differences, 70, 155—168. | |
| Hori, M., Kihara, Y., & Kato, T. (2011). Investigation of indirect oral operation method for think aloud usability testing. In Human Centered Design: Second International Conference, HCD 2011, Held as Part of HCI International 2011, Orlando, FL, USA, July 9-14, 2011. Proceedings 2 (pp. 38-46). Springer Berlin Heidelberg. | |
| Jacobse, A. E., & Harskamp, E. G. (2012). Towards efficient measurement of metacognition in mathematical problem solving. Metacognition and Learning, 7, 133- 149. | |
| Kaminski, K. S., & Sporer, S. L. (2017). Discriminating between correct and incorrect eyewitness identifications: The use of appropriate cues. Journal of experimental psychology: applied, 23 (1), 59. | |
| Koedinger, K. R., Baker, R. S., Cunningham, K., Skogsholm, A., Leber, B., & Stamper, J. (2010). A data repository for the EDM community: The PSLC DataShop. Handbook of educational data mining, 43, 43- 56. | |
| Laird, J. E. (2001). It knows what you're going to do: Adding anticipation to a Quakebot. In Proceedings of the fifth international conference on Autonomous agents (pp. 385-392). | |
| Lara, B., Gaona, W., Escobar, E., Pardo, J. M., & Hermosillo-Valadez, J. (2021). Development of body-based spatial knowledge through mental imagery in an artificial agent. Adaptive Behavior, 29 (4), 349- 368. | |
| Lemaignan, S., Warnier, M., Sisbot, E. A., Clodic, A., & Alami, R. (2017). Artificial cognition for social human–robot interaction: An implementation. Artificial Intelligence, 247, 45- 69. | |
| Li, G., Hammoud, H., Itani, H., Khizbullin, D., & Ghanem, B. (2023). Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36, 51991- 52008. | |
| Li, Z., Shi, L., Wang, J., Cristea, A. I., & Zhou, Y. (2023). Sim-GAIL: A generative adversarial imitation learning approach of student modelling for intelligent tutoring systems. Neural Computing and Applications, 35 (34), 24369- 24388. | |
| Liu, C., Zhu, E., Zhang, Q., & Wei, X. (2018). Modeling of agent cognition in extensive games via artificial neural networks. IEEE transactions on neural networks and learning systems, 29 (10), 4857- 4868. | |
| Loxterman, J. A., Beck, I. L., & McKeown, M. G. (1994). The effects of thinking aloud during reading on students' comprehension of more or less coherent text. Reading research quarterly, 353—367. | |
| MacLellan, C. J., & Koedinger, K. R. (2022). Domain-General Tutor Authoring with Apprentice Learner Models. International Journal of Artificial Intelligence in Education, 32 (1), 76- 117. | |
| MacLellan, C. J., Harpstead, E., Wiese, E. S., Zou, M., Matsuda, N., Aleven, V., & Koedinger, K. R. (2015). Authoring Tutors with Complex Solutions: A Comparative Analysis of Example Tracing and SimStudent. In AIED workshops. | |
| Matsuda, N., Cohen, W. W., & Koedinger, K. R. (2015). Teaching the Teacher: Tutoring SimStudent Leads to More Effective Cognitive Tutor Authoring. International Journal of Artificial Intelligence in Education, 25 (1), 1- 34. | |
| Nielsen, J., Clemmensen, T., & Yssing, C. (2002). Getting access to what goes on in people's heads? Reflections on the think-aloud technique. In Proceedings of the second Nordic conference on Human-computer interaction (pp. 101-110). | |
| Overton, T., Potter, N., & Leng, C. (2013). A study of approaches to solving open-ended problems in chemistry. Chemistry Education Research and Practice, 14(4), 468—475. | |
| Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology (pp. 1-22). | |
| Perner, J., & Roessler, J. (2012). From infants’ to children's appreciation of belief. Trends in cognitive sciences, 16 (10), 519- 525. | |
| Pike, M. F., Maior, H. A., Porcheron, M., Sharples, S. C., & Wilson, M. L. (2014, April). Measuring the effect of think aloud protocols on workload using fNIRS. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 3807-3816). | |
| Premack, D., & Woodruff, G. (1978). Does the chimpanzee have a theory of mind?. Behavioral and brain sciences, 1 (4), 515- 526. | |
| Ritter, S., Anderson, J. R., Koedinger, K. R., & Corbett, A. (2007). Cognitive Tutor: Applied research in mathematics education. Psychonomic bulletin & review, 14, 249- 255. | |
| Roth, W. M., Oliveri, M. E., Sandilands, D. D., Lyons-Thomas, J., & Ercikan, K. (2013). Investigating linguistic sources of differential item functioning using expert think-aloud protocols in science achievement tests. International Journal of Science Education, 35 (4), 546- 576. | |
| Ruan, Y., Dong, H., Wang, A., Pitis, S., Zhou, Y., Ba, J., . . & Hashimoto, T. (2023). Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv: 2309.15817. | |
| Sallam, M. (2023). Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare, 11 (6), 887. | |
| Shao, Y., Li, L., Dai, J., & Qiu, X. (2023). Character-LLM: A trainable agent for role-playing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 13153–13187). | |
| Siemens, G., Marmolejo-Ramos, F., Gabriel, F., Medeiros, K., Marrone, R., Joksimovic, S., & de Laat, M. (2022). Human and artificial cognition. Computers and Education: Artificial Intelligence, 3, 100107. | |
| Taylor, J. E. T., & Taylor, G. W. (2021). Artificial cognition: How experimental psychology can help generate explainable artificial intelligence. Psychonomic Bulletin & Review, 28 (2), 454- 475. | |
| Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J. B., Yu, J., . . & Ahn, J. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv: 2312.11805. | |
| Tu, Q., Fan, S., Tian, Z., & Yan, R. (2024). Charactereval: A chinese benchmark for role-playing conversational agent evaluation. arXiv preprint arXiv: 2401.01275. | |
| Van Pinxteren, M. M., Pluymaekers, M., & Lemmink, J. G. (2020). Human-like communication in conversational agents: a literature review and research agenda. Journal of Service Management, 31 (2), 203- 225. | |
| VanLehn, K., Ohlsson, S., & Nason, R. (1994). Applications of simulated students: An exploration. Journal of Artificial Intelligence in Education, 5, 135- 135. | |
| Vygotsky, L. S. (2012). Thought and language. MIT press. | |
| Wallach, W., Franklin, S., & Allen, C. (2010). A conceptual and computational model of moral decision making in human and artificial agents. Topics in cognitive science, 2 (3), 454- 485. | |
| Wang, X., Xiao, Y., Huang, J. T., Yuan, S., Xu, R., Guo, H., . . & Xiao, Y. (2024). Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1840-1873). | |
| Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., . . & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824—24837. | |
| Wimmer, H., & Perner, J. (1983). Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children's understanding of deception. Cognition, 13 (1), 103- 128. | |
| Xu, S., & Zhang, X. (2023). Leveraging generative artificial intelligence to simulate student learning behavior. arXiv preprint arXiv: 2310.19206. | |
| Xu, S., Zhang, X., & Qin, L. (2024). EduAgent: Generative Student Agents in Learning. arXiv preprint arXiv: 2404.07963. | |
| Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., . . & Fan, Z. (2024). Qwen2 technical report. arXiv preprint arXiv: 2407.10671. | |
| Zhang, J., Borchers, C., Aleven, V., & Baker, R. S. (2024). Using large language models to detect self-regulated learning in think-aloud protocols. In Proceedings of the 17th International Conference on Educational Data Mining. |
| [1] | 黄昌勤, 钟益华, 王希哲, 韩中美, 魏同权. 从单智能体到多智能体:大模型智能体支持下的激励型学习活动设计与实证研究[J]. 华东师范大学学报(教育科学版), 2025, 43(5): 44-56. |
| [2] | 顾小清, 郝祥军. 悟空的毫毛:正在重塑学习技术系统的多智能体[J]. 华东师范大学学报(教育科学版), 2025, 43(5): 16-29. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||