Abstract:
Software developers often need to refer a large number of base packages developed by other developers during the development. In order to obtain usage in addition to the base package development documentation, the software developer code keywords are entered into the code search engine search code snippet. This paper proposes a code search method based on vector representation, which collects code fragments in Github and Stack Overflow data sets, trains a skip-gram model of extended code words, and uses this model to augment the association with code words extracted from search text. The search keyword is obtained by getting a search keyword context code segment vector group, encoding the search keyword context code segment vector group and the to-be-matched code segment vector group, and calculating the cosine similarity ranking to generate the search result. In order to verify the effectiveness of the proposed algorithm, the validity of the algorithm was verified on the Github dataset and Stack Overflow. Results of the tests on the Stack Overflow dataset show that 58% of searches can find the correct answer in the first search result.65% of the search can find the correct answer in the first five answers.72% of the search can find the correct answer in the first ten answers.And a certain degree of improvement in the recall rate and
F value.Results of the tests on the Github dataset show that 59% of searches can find the correct answer in the first search result.67% of the search can find the correct answer in the first five answers.74% of the search can find the right answer in the first ten answers and a certain degree of improvement in the recall rate and
F value.The experimental results show that the algorithm proposed in this paper is better than the search results of typical methods for code retrieval of large amounts of data.