...
首页> 外文期刊>The Journal of Systems and Software >SCC++: Predicting the programming language of questions and snippets of Stack Overflow
【24h】

SCC++: Predicting the programming language of questions and snippets of Stack Overflow

机译:SCC ++:预测问题和堆栈溢出摘要的编程语言

获取原文
获取原文并翻译 | 示例
           

摘要

Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted on Stack Overflow usually contain a code snippet. Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we design and evaluate Source Code Classification (SCC++), a classifier that can identify the programming language of a question posted on Stack Overflow. The classifier achieves an accuracy of 88.9% in classifying programming languages by combining features from the title, body and the code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 78.9%. Finally, we propose a classifier of code snippets only that achieves an accuracy of 78.1%. These results show that deploying Machine Learning techniques on the combination of text and code snippets of a question provides the best performance. In addition, the classifier can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.
机译:Stack Overflow是软件开发人员中最受欢迎的问答网站。作为知识共享和获取的平台,Stack Overflow上发布的问题通常包含一个代码片段。研究社区已经考虑确定源代码文件的编程语言。已经表明,机器学习(ML)和自然语言处理(NLP)算法可以有效地识别源代码文件的编程语言。但是,确定代码片段或几行源代码的编程语言仍然是一项艰巨的任务。诸如Stack Overflow之类的在线论坛和诸如GitHub之类的代码库包含大量的代码片段。在本文中,我们设计和评估了源代码分类(SCC ++),这是一个分类器,可以识别在Stack Overflow上发布的问题的编程语言。通过结合问题的标题,正文和代码片段的功能,分类器在对编程语言进行分类时可达到88.9%的准确性。我们还提出了仅使用问题标题和正文的分类器,其准确性为78.9%。最后,我们提出了仅能达到78.1%的准确性的代码片段分类器。这些结果表明,在问题的文本和代码段组合上部署机器学习技术可提供最佳性能。此外,分类器可以区分来自一系列编程语言(例如C,C ++和C#)的代码片段,还可以标识诸如C#3.0,C#4.0和C#5.0之类的编程语言版本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号