A systematic mapping study of source code representation for deep learning in software engineering

The usage of deep learning (DL) approaches for software engineering has attracted much attention, particularly in source code modelling and analysis. However, in order to use DL, source code needs to be formatted to fit the expected input form of DL models. This problem is known as source code representation. Source code can be represented via different approaches, most importantly, the tree ‐ based, token ‐ based, and graph ‐ based approaches. We use a systematic mapping study to investigate i detail the representation approaches adopted in 103 studies that use DL in the context of software engineering. Thus, studies are collected from 2014 to 2021 from 14 different journals and 27 conferences. We show that each way of representing source code can provide a different, yet orthogonal view of the same source code. Thus, different software engineering tasks might require different (combinations of) code representation approaches, depending on the nature and complexity of the task. Particularly, we show that it is crucial to define whether the DL approach requires lexical, syntactical, or semantic code information. Our analysis shows that a wide range of different representations and combinations of representations (hybrid representations) are used to solve a wide range of common software engineering problems. However, we also observe that current research does not generally attempt to transfer existing representations or models to other studies even though there are other contexts in which these representations and models may also be useful. We believe that there is potential for more reuse and the application of transfer learning when applying DL to software engineering tasks.


| INTRODUCTION
Machine learning (ML), and nowadays deep learning (DL), is increasingly used by software engineering (SE) researchers and practitioners for a wide range of tasks. Examples include source code classification [1][2][3], code clone detection [4][5][6], bug detection [7][8][9], or code summarisation [10][11][12]. The current interest in DL is enabled by the wide availability of large-scale data (e.g., through open-source systems hosted on platforms, such as GitHub). Particularly, DL is interesting to researchers as it promises good results (e.g., highly accurate code clone detection) without the need for cumbersome (and often limiting) explicit feature extraction process from the raw data as it is required by traditional machine learning models [13].
In classical machine learning approaches, a considerable amount of effort goes to the design of proper ways to capture the structure of the data, that is, feature engineering, which is a "human" effort in most of the cases. This is the reason why, in the last decade, attention in machine learning is moving to 'representation learning', which consists of automatically extracting or learning features without the need of human feature engineering. In representation learning, feature engineering and selection phases are taken away and replaced with deep learning neural networks. DL models are composed of multiple layers to learn data representations with multiple higher levels of abstraction [14]. The networks are supposed to learn the data representation automatically, simulating the human brain for learning and analysis. Moreover, neural networks can be used to learn a representation of input data, such as program source code.
However, using a DL model does not entirely free researchers from all preparatory work. In order to use these techniques, appropriate features first need first to be extracted from the program source code and represented in a way the DL model can understand. This process is known as 'code representation'. Code representation is the process of transforming the textual program source code into a generic input format acceptable to the DL model [15]. Researchers can make use of different representation approaches, depending on the kind of information that needs to be extracted. Examples include token-based representation for lexical information, tree-based for extract syntactical information, and graph-based for semantic information.
No single DL and code representation approach is a silver bullet that works ideally on every case. Furthermore, in practice, choosing a suitable code representation approach is not trivial as the choice is heavily impacted not only by which DL models should be employed, but also by the requirements of the software engineering task that should be addressed. Some problems might require to focus on the semantics of the code rather than the syntax. For example, a research contribution in code summarization will require different types of information to be extracted than a clone detection approach. Currently, there is no study that has investigated which representation approaches are predominantly used for which types of problems, nor is there collective evidence regarding which approaches work better for which use cases.
In this paper, we address this gap through a systematic mapping study. We systematically collected a dataset of 103 studies published between 2014 and 2021 in 20 different conferences and journals. Our primary goal was to investigate academic studies that propose or evaluate the usage of DL and code representation to address practical software engineering tasks, such as source code classification [1] or code clone detection [5]. Our main acceptance criterion was that studies needed to (a) employ DL to address a practical software engineering task (excluding studies that use DL as a tool to conduct software engineering research, such as identifying automated code contributions [16]), and (b) explicitly discuss their code representation approach. The goal of this study is to provide an exhaustive analysis and overview on the progress achieved in using DL models in different software engineering tasks. We further discuss current best practices and elaborate on gaps in the current state of research.
We show that each way of code representation can provide a different, yet orthogonal view of the same source code. Thus, different SE tasks might require different (combinations of) code representation approaches, depending on the nature and complexity of the task. Particularly, we show that it is crucial to define whether the DL approaches require lexical, syntactical, or semantic code information. Our analysis shows that a wide range of different representations are used to tackle a wide range of common SE problems. We find that all three major types of code representation (token-, tree-, and graph-based) are employed, but tree-based (typically based on Abstract Syntax Trees, ASTs) approaches are currently the most used. Graph-based representations are not yet common, but a growing area of research. Hybrid representations, which combine different representations approaches in a single approach, are also seeing increasing use.
Nevertheless, our results also show a lack of generalizability of the presented approaches to other tasks as well as a lack of validation based on industrial datasets. Most studies construct models for a single limited-scope task based on open-source data and rarely validate the constructed model outside of the open-source domain. Evidently, industrial datasets are not inherently superior to open-source ones. However, during our review, it became clear that virtually all analysed studies are based on open-source data, published data sets (which are often also constructed based on open-source data), or in some cases, artificial data. We argue that this limits the generalizability of the investigated studies to closedsource industrial applications and denotes a gap in the current research.
The rest of this paper is structured as follows. We present necessary background about code representation in Section 2. In Section 3, we detail the applied mapping study methodology and research questions and also provide an overview of the 103 papers that form the basis of our discussion. Afterwards, in Sections 4-8, we elaborate on the findings of the mapping study per research question. This is followed by presenting the research gaps and challenges in Section 9 and potential future directions in Section 10. Finally, we conclude the paper in Section 11.

| PRELIMINARIES
To contextualise the rest of this study, we now present some background about code representation. In particular, we introduce three possible forms about how source code can be represented in DL. In the literature, the code representation approaches are classified into four categories: Token-based, tree-based, graph-based, and others [17]. Every form maps different syntactical and semantic aspects of the source code to a specific data structure. These representations can then be embedded in a neural network so that they can use source code as input.
Source code is originally a text encoding representing a programme. This can be processed and further transformed into different representations forms. In this section, we describe three well-known representations, each one mapping certain aspects of the original source code. We use the C snippet depicted in Listing 1 as a running example. 1 void foo() { 2 int x = source(); 3

| Token-based representation
This representation treats code as free text. Thus, it converts the code into a list of tokens where each word (e.g., "void") is a token, but each special character (e.g., '(') is also a token (rather than considering it as part of a word). An example is given in Listing 2.
Listing 2 Token representation for the code in Listing 1 Then, each token will be encoded into a vector of numbers using different statistical language models, such as word embedding [18] or n-grams [19]. In principle, word embedding is a learned representation for text where words that have the same meaning get a similar representation. Technically, word embeddings are a class of techniques where individual words are represented as real-valued vectors in a predefined vector space [20]. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network and hence, the technique is often lumped into the field of DL. As for n-grams, they are useful abstractions for modelling sequential data, such as text, where there are dependencies among the terms in a sequence. However, a corpus of code can be regarded as a sequence of sequences, and corpus-based models, such as ngrams, learn conditional probability distributions from the order of terms in a corpus. Corpus-based models can be used for many different types of tasks, such as discriminating instances of data or generating new data that are characteristic of a domain. Embeddings can be considered as a way to represent words and help the DL model to learn the representation of the source code. N-grams are several words appearing together. An embedding can be trained to represent n-grams or just individual words.

| Tree-based representation
This representation treats the abstract syntactic structure of the source code. ASTs are a kind of tree representation approach that is widely used by a programming language and SE tools. Figure 1 shows an example of an AST representation. The nodes of the AST tree are related to constructs or symbols of the source code. In comparison to the token-based approach, AST representation is abstract and does not include all available details, such as punctuation and delimiters. Theoretically, ASTs can be used to illustrate the lexical information and the syntactic structure of source code, such as the function name, and the flow of the instructions (e.g., in an if or while construct). Recently, some approaches combined neural networks and ASTs to constitute tree-based neural networks (TNNs) [21]. Given a tree, TNNs learn the vector representation by recursively computing node embeddings in a bottomup way. Popular TNN models are the Recursive Neural Network (RvNN) [22], Tree-based Convolutional Neural Network (TBCNN) [3], and Tree-based Long Short-Term Memory (Tree-LSTM) [23].

| Graph-based representation
In this approach, source code is represented as a graph on many different levels. Levels of representation will define the type of the representation graph. Thus, a control flow graph (CFG, see Figure 2a) describes the sequence in which the instructions of a programme will be executed. Thus, the graph is determined by conditional statements, for example, if, for, and switch statements. In CFGs, nodes denote statements and conditions, and they are connected by directed edges to indicate the transfer of control.
Alternatively, the representation might be a data flow that is variable-oriented. Thus, a data flow graph (DFG) is used to follow and track the usage of the variables through the CFG. A DFG edge represents the subsequent access or modification onto the same variables. Call flow graph (CallFG) captures the relation between a statement which calls a function and the called function [24]. Finally, the entire programme can be represented as a graph using a programme dependency graph (PDG, see Figure 2b), where statements and predicate expressions can be characterised by the nodes. In this study, we differentiate between the tree-and graph-based approaches since each representation approach is used to retrieve a F I G U R E 1 Abstract syntax tree for the code snippet in Listing 1 SAMOAA ET AL.
-3 different level of information from the source code. Thus, the tree-based approach, such as using the AST, is used to extract the syntactical information from the source code [21], whereas graph-based approaches, such CFG or DFG, extract semantic information [25].

| RESEARCH METHODOLOGY
Our goal is to study what code representation approaches are used in combination with DL within the field of software engineering, and which code representation approaches are suitable for which tasks. Our primary method is a systematic mapping study. Systematic Mapping studies are a commonly used research method to systematically analyse a mature body of research and to derive recommendations from a disparate, large body of published works.

| Research questions
To effectively conduct a systematic mapping study, it is crucial to have well-defined research questions. The research questions analyse the main attributes of the study, which are code representation, DL, and tackled software engineering tasks, on multiple levels. In the following, we present the headlines of our research questions along with the corresponding detailed questions.

RQ 1 Main Attributes Analysis
In RQ1, we are primarily interested in which software engineering tasks, DL models, and code representation approaches are currently prominently investigated in the field of study. This research question explores the software engineering problems that are commonly tackled with DL using the code representation. This is crucial to contextualise and further analyse our subsequent findings.

RQ 1.2 DL Models: Which DL models are being used in conjunction with code representation in software engineering research?
While other review studies DL and SE tasks in more details, our goal is also to investigate what DL models are specifically used with a strong emphasis on code representation.

RQ 1.3 Code Representation Approaches: Which code representation approaches are being used?
Finally, it is evidently important to our study goal to identify the basic code representation approaches that literature currently has to offer to software engineering researchers.

RQ 2 Detailed Analysis Based on SE Tasks
Within RQ2, we conduct a deeper analysis of our dataset to identify which code representation approaches, on one side, and DL, on the other side, are commonly used to tackle which kinds of problems. Particularly, we are interested in identifying characteristics and commonalities of tasks that make them particularly amenable to a specific type of representation or model.

RQ 2.1 Tasks and Models:
Which DL models are being used to tackle which software engineering tasks?
Firstly, we correlate software engineering tasks with used DL models with the goal of identifying which models are particularly suitable to solve which tasks.

RQ 2.2 Tasks and Representations:
Which software engineering tasks and representation approaches are being used?
Secondly, we further correlate software engineering tasks with used code representation approaches.

RQ 3 Main Attributes-Cross Analysis-Which code rep-
resentation and DL models are commonly used approaches to solve a specific software engineering task?
In RQ3, we perform the analysis on all three main attributes (task, representation, and model) together to map the code representation and DL models with different software tasks.

RQ 4 Hybrid Approach Analysis
In RQ4, we analyse the studies that combine different approaches in one framework. In the rest of this paper, we will (a) (b) be referring to the overall solution presented in the retrieved studies as the 'framework'. These approaches are considered to provide valuable characteristics since they have either a wider scope to solve multiple tasks simultaneously, or more powerful capabilities in fulfilling many requirements by integrating multiple representation approaches. However, this integration between multiple approaches would increase the cost of implementing these (fairly complex) frameworks.

RQ 4.1 Hybrid Software Tasks:
What are the characteristics of frameworks that handle multiple software tasks? How are the different software tasks processed?
We first scrutinise the studies that are set to solve different software tasks at once. The overarching aim is to elicit insights into the strategies followed to tackle multiple different tasks. In this research question, we study research that exploits multiple representation approaches at the same time. We also examine how this integration is carried out to expand the efficiency of the framework. Finally, our study raises the question which promising areas are currently underexplored, and warrant future research in the software engineering field.

| Literature search and selection
To conduct our study, we followed the process outlined in Figure 3. We used a two-step method for literature search. Firstly, we collected an initial set of candidate papers through a database search. Secondly, we used iterative backward and forward snowballing to extent this initial candidate set (the seed).
For constructing the initial candidate set, we have relied on a single primary search database (Google Scholar) rather than aggregating results from different digital libraries, such as the ACM Digital Library or IEEEXplorer. The reason for this was two-fold: (1) Google Scholar has a highly complete index, and it is unlikely that searching in other libraries would lead to additional search results, and (2) since we heavily made use of snowballing, completeness of the initial candidate set was deemed less crucial (as important missing work would appear during the snowballing process).
The initial candidate list was generated by executing the following search term on Google Scholar:

code representation for deep learning
We screened the first five pages of search results based on paper title and abstract. These potentially relevant papers were then evaluated with regard to our inclusion criteria (see Section 3.3). If a paper matched the criteria, it was added to the study dataset. After five pages (and initial snowballing), we have observed saturation, that is, investigation of the next two pages of search results did not lead to further papers matching the inclusion criteria. Hence, we stopped the search at this point.
We used explicit backward and forward snowballing to extend our initial set of candidate papers: for each selected paper, we further screened the reference list for additional relevant papers and also used Google Scholar's "cited by" functionality to discover later papers that have referenced papers in our initial set. We applied the same basic strategy to these additional candidate papers (screening based on title and abstract, followed by an explicit evaluation of inclusion criteria). This process has been repeated iteratively until no new papers could be found.

| Inclusion criteria
To clearly delineate papers that are within the scope of our study, we defined the following inclusion criteria: � I1: Published in 2014 or later. We chose 2014 as a cutoff point because this was the year the TensorFlow system was initially released. � I2: Making use of DL as a core contribution of the paper and explicitly reporting on the used code representation approach. To illustrate this criterion, we discuss the following study as a counterexample [26]. In this study, Laaber et al. tackled an SE task (predictability of system performance) and the authors used an artificial neural network (ANN) as a DL model for that task. However, the authors do not report on a specific code representation approach as they relied on the static features of the source code (e.g., lines of code or the number of loops I1-I3 define the topical relevance to our study goals. I4 was important to ensure that all the data required for our study are actually reported by the papers in our dataset. We did not focus on publications in a specific venue and also accepted unpublished academic preprints if no published version of the paper exists. To be selected into the dataset, a paper had to fulfill all the four inclusion criteria.

| Resulting study dataset
Applying this literature search and selection procedure resulted in a dataset of 103 relevant studies, which are listed in Appendix A. Figure 4 indicates the distribution of the papers in our dataset over time between 2014 and 2021. It can be observed that the number of relevant studies has increased over the years. With only two relevant publications in 2014 to reach 30 publications in 2019, then we observe a slight decline in 2020 (the last complete year in our study) with 19 publications. It is also interesting to notice the steadily increasing fraction of publications in academic journals rather than conferences or workshops.
In Figure 5, we have summarised the conference venues, which are common targets in this field of study. Conference venues with only one publication are not depicted in the figure. Unsurprisingly, ICLR, which is dedicated to presenting the advancement in representation learning, is the biggest contributor to our dataset with nine studies. It is followed by ICSE, which is widely seen as one of the highest ranked software engineering conferences, with eight studies, and MSR with seven studies. A smaller subset of our dataset has been published in ML venues, such as AAAI, NeurIPS, or ICML, or in programming languages venue, such as PLDI. The abbreviations of the venues presented in Figure 5 are listed in Appendix B.

| Data extraction, coding, and analysis
To analyse this dataset that answers the research questions of this study, a coding taxonomy was developed. The taxonomy is presented in Table 1. We consider the three categories (code representation approach, DL model, and Software Tasks) as primary attributes whereas we mentioned the code-level and programming languages in RQ2.3, in part related to code representation. Following our research questions, we iteratively (1) programming language, (2) code-level granularity, (3) used code representation approach, (4) used DL model, and (5) the software task. Each publication in the dataset was coded by the first and second authors according to the taxonomy (in addition to collecting basic bibliographical information, such as the publication date and venue) with the other authors serving as sounding board and helping to resolve possible ambiguities. The resulting data were then analysed and plotted using Python scripts. We make the final coding sheet as well as the analysis script available in a replication package [27].
The detailed coding taxonomy is sketched in Table 1 and discussed in the following.
Programming Language: while DL is in principle not dependent on a specific programming language, concrete feature extraction techniques for code representation need to be built custom for individual programming languages. In our study, C, C++, C#, Java, JavaScript, and Python have emerged as target programming languages.
Code-Level Granularity: programme code can fundamentally be represented on different levels in a code representation approach. In our study, we distinguish between approaches that consider methods, functions, or similar as atomic unit [5,28], from those that attempt to represent the programme code on a statement level [29,30].
Code Representation: as the main target of this research, different code representation approaches were distinguished on a fine-grained level. We distinguish between token-based, tree-based, graph-based, and other approaches. For tokenbased approaches, word embedding and n-grams [31] have emerged as clearly distinct groups. The only tree-based approaches [32] in our dataset are based on abstract syntax tree (AST). For graph-based approaches [33], we distinguish between CFG-, DFG-, PDG, and CallFG-based approaches, which capture the relation between a statement that calls a function and the called function [24]. Other code representation approaches that do not fall clearly into these groups are bytecode, ASCII, code gadget, latent semantic indexing, and binary visualisation since each approach has appeared only once in the retrieved list of papers. More examples for each approach will be mentioned as part of the discussion of results in Section 8.2.
DL Models: the main DL models that emerged in our coding as common methods in software engineering research are ANN [34], Convolutional Neural Network (CNN) [35], Recurrent Neural Network (RNN) [36], Graph Neural Network (GNN) [37], Long-Short Term Memory (LSTM) [38], and autoencoder and attention mechanism [39]. Additionally, three further models [deep belief network (DBN), neural machine translation (NMT), and deep reinforcement learning (RL)] emerged in two, four, and one publications, respectively, and we combine those in the group 'Others'. It is worth mentioning that we distinguished LSTM from RNN and was listed as a separate type (and not counted when referring to RNN) since there are frameworks that combine AST with LSTM, which is referred as tree LSTM [23], and other frameworks that combine AST with RNN, which is referred as RvNN [22]. Software Engineering Tasks: to identify for which projects' code representation gets used, we also extracted the one or multiple software engineering tasks from the papers in the dataset. We observed that many common fields of study within software engineering were present. Particularly, we observed works related to code clone detection, code similarity detection [4], programme repair, programme generation [40], vulnerability detection [41], source code classification [1], bug detection [42], code summarisation [43], identifier generation [44], and code search [45]. Other tasks that emerged, but were investigated less frequently, were related to fixing formatting [46], traceability [47], compiler analysis [24], programme synthesis [10], malicious behaviour detection [48], performance prediction [49], code smell detection [50], and error handling [51].

| Data validation
To conduct a preliminary validation of the completeness of our data set, we selected five recent studies from high-profile software engineering venues that applied machine learning to one of the tasks in our study (see Table 1). We checked each reference cited by these recent studies against our inclusion criteria and validated for each study matching our criteria, whether they were indeed contained in our study set. No publications have been found to be missing.

| Threats to validity
Despite following a well-defined methodology, a review study such as ours is always subject to limitations and threats to validity. We use the classification proposed by Ampatzoglou et al. [52] to contextualise these threats.

� Construction of the Search Process and Generalisability:
We chose to construct our dataset based on an initial search on Google Scholar followed by extensive snowballing, rather than a more conventional search strategy using major digital libraries, such as Scopus, IEEE Xplore, ScienceDirect, or the ACM Digital Library. We argue that relying on snowballing leads to a more complete and comprehensive dataset than traditional search, which suffers from limitations due to inconsistent naming and terminology. However, one challenge is that it is hard to conduct an identical replication of our study since Google Scholar personalises search results. To mitigate this threat, we provide a replication package that includes all studied manuscripts as well as our resulting coding sheet. � Study Inclusion/Exclusion Bias: DL is a rapidly growing area of research within software engineering. Hence, we needed to make decision when to stop accepting newly appearing papers into our dataset. While we do not believe that the overall findings would have been impacted if we had collected studies for a longer period of time, readers should still take our data collection period in mind when interpreting our results. � Validity of Primary Studies: Four studies in our dataset are pre-prints retrieved from arXiv. While those are not peerreviewed, the included studies are highly cited and highly influential in our field. Hence, we consider it important to include them in the analysis despite the threat that is introduced by the lack of peer review. � Data Extraction Bias: While many of our coding dimensions lend themselves to objective categorisation, judgement calls still needed to be made in some cases. In these cases, we discussed among the author group to reach a consensus decision.

| AN OVERVIEW OF USAGE OF DEEP LEARNING IN SE TASKS
This section allows us to establish a general "process" overview of the steps required to make DL work in software engineering. While it is not expected that this general framework will differ drastically from DL in other domains, it will allow us to put the rest of the survey in context, identify the place of code representation in this general process, and serve as a guiding rail for novices to the domain. Thus, we provide a general framework of code representation and DL models' usage for tasks in software engineering based on the reviewed studies. This model has emerged from qualitatively investigating the DL models of the studies in our dataset.

| High-level process
The resulting model is depicted in Figure 6. Unsurprisingly, the high-level architecture is comparable to the usage of DL in other domains and consists of the well-known phases of data collection, data preparation and preprocessing, as well as learning and validation.
Data Collection: The process starts with data collection, which in the domain of software engineering typically entails collecting the source code files for a specific programming language (e.g., through repository mining). Subsequently, the dataset needs to be annotated to serve as a training set. The annotation process is custom to the specific software engineering task that is intended to be tackled, for example, bug prediction evidently requires different annotations than, for example, code clone detection. The dataset is either ready and pre-annotated by domain experts or the researchers that conduct the study annotate the source code themselves. Annotations are task-specific and may for example, include information about the presence of bugs, or if the two code files are to be considered code clones [53].
Data Preparation and Preprocessing: Afterwards, in the data preparation and preprocessing phase, the collected code must be represented in a form that is compatible with DL. This is where code representation, the main subject of our study, comes into play. For example, in an AST representation, the collected code is converted into a tree form; then the tree paths need to be encoded or embedded as numeric values (vectorisation) using approaches such as one-hot encoding or word embedding. On the contrary, in a graph-based representation, a variety of graph embedding techniques are used, such as Graph2vec [54], HOPE [55], SDNE [56], or Node2vec [57]. Features can now be extracted from those vectors through different approaches, such as convolutional or sequential neural networks.
Learning and Validation Phase: Finally, the DL model will be trained and validated based on the tackled software engineering task.

| Examples
To concretise this process, we now present two examples of publications that follow the framework shown in Figure 6. The first example [58] proposes a framework for (information retrieval) based neural source code summarisation. The solution specifically makes use of an attention encoder-decoder model. Figure 7 depicts the approach using the model introduced previously.
After collecting training data as a first step, source code is represented as ASTs, which are then turned into syntactic token sequences by tree traversal. Then, a trained encoder based on LSTM units is used to embed the code into a semantic vector using pooling, which is used to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network and preserve the most important features. Afterwards, a bidirectional LSTM decoder is used to capture the semantic context to generate natural-language summaries. The motivation behind this solution is that recent studies that use models of neural networks prefer high-frequency words in the corpus while struggling with low-frequency ones. The proposed method takes advantage of both neural and retrieval-based techniques to alleviate this problem.
Projecting Figure 7 on the main representative Figure 6, the code fragment part maps the data collection from AST to semantic vector is mapped to data preparation and preprocessing. The attention part, along with the bidirectional-LSTM decoder, presents the learning and validation phase. A second example [59] uses code representation and DL for code clone detection. In this work, and as shown in Figure 8, the authors treat the AST as a graph by following a flow-augmented abstract syntax tree (FA-AST) to build a graph representation for code fragments. This is done by adding edges representing control and data flow to the AST. Graph representation is applied here as AST-based approaches cannot fully leverage the structural information of code fragments, especially semantic information, such as the control and data flow. After representing the AST as a graph, the vectors of F I G U R E 7 An example framework for code summarisation, based on Zhang et al. [58]. AST, abstract syntax tree; LSTM, long short-term memory F I G U R E 8 An example framework for code clone detection based on Wang et al. [59]. AST, abstract syntax tree; FA-AST, flow-augmented abstract syntax tree; GGNN, gated graph neural network; GMN, graph matching network SAMOAA ET AL.
nodes are pooled into a graph-level vector representation. Hence, two different types of graph neural networks (GNN) are used: a gated graph neural network (GGNN) for graph embedding and a graph matching network (GMN), which can jointly learn embedding for a pair of graphs.
When mapping the approach explained in Figure 8 to the common architecture in Figure 6, the code fragment is part of data collecting, while going from AST to GGNN represents data preparation and preprocessing. Finally, the GMN is part of the learning and validation phase.

| MAIN ATTRIBUTES ANALYSIS
In this section, we will answer RQ1 by exploring this study's three main attributes in isolation. We answer the question about which software engineering tasks are tackled by the studies in our dataset, and what code representation and DL approaches are being used to do so.

| Software engineering tasks
DL is used for a large variety of different tasks in software engineering. Hence, to answer RQ1.1, we cluster the tasks into four broad groups inspired by work from Microsoft 1 based on high-level techniques and goals. Groups and concrete tasks, as well as their absolute and relative frequencies in our dataset, are shown in Table 2. It should be noted that the sum of percentages does not add up to 100% as some publications tackle multiple problems simultaneously.
Code-Code: the model's input is the code, and the model's output is also a source code (e.g., complete programs or code snippets). Example of tasks clustered under code-code are clone and similarity detection, code completion, programme generation and repair. Less-frequent code-code task in our dataset (grouped as "other") is fixing formatting, traceability, and compiler analysis. Code-code tasks are a natural fit for DL and hence a frequent target in our dataset, representing 46% of all studies. Code clone detection is the most frequent individual task, followed by the (very related) task of code similarity detection and programme repair.
Code-Text: the input of the learning model is code, whereas the output is (often natural-language) text. A canonical example of this type of task is code summarisation, where the goal is to produce natural language summaries of source code constructs. The only other code-text task we found is identifier generation, which includes suggestions of method or variable names based on code information. As a group, code-text approaches represent about 20% of the studies in our dataset. However, this is primarily due to code summarisation individually being a common area of interest in DL for software engineering (representing 15% of the studies). Identifier Generation appears in 7 studies. The total count of the papers that tackle code-text is 21, as one study [60] is about both, summarisation and identifier generation.
Text-Code: this group is the opposite of the previous group, where the input is the natural language text with code output. The only two tasks in our dataset of this type are code search and programme synthesis. As for code search, it uses the query text to find the corresponding source code. This task represents about 5% of the dataset. Programme synthesis, on the other hand, takes free text descriptions of programme functions as an input and returns source code as an output. There is only one study in our dataset that tackles this task.
Code-Prediction: finally, DL can be used to predict qualities based on code, such as detecting vulnerabilities, bugs, or malicious behaviour. We also group source code classification in this category. Two studies are grouped as "other" in this group: error handling and code smell detection. As a group, code-prediction is quite prevalent, accounting for 37.8% of the studies in our dataset. Within this group, different tasks are well distributed with the most common one being bug detection (14%) followed by vulnerability detection (11%).
We present the complete mapping of papers to our taxonomy of SE tasks in Appendix C.

| Deep learning models
We now present the DL models used in the retrieved studies, as per RQ1.2. Various DL models have been identified in software engineering research. A graphical overview is given in Figure 9. LSTM [38], which is a type of RNN, is the most used DL approach and found in 49 (48%) studies. LSTM copes with the problem of RNNs known as "vanishing gradients" by adding the mechanism of "cell states" to selectively remember, or forget, part of the information that is needed during training [61]. Attention mechanism [39] and CNN [62] are the second and third most used DL models with 35 and 28 publications, respectively. CNNs are particularly efficient since they can work in parallel on sequences and have a structure for which the output and input have a logarithmic distance in terms of layers, which is linear for RNNs and LSTMs. The use of CNNs together with an attention mechanism (specifically "selfattention") defines the architecture of 'Transformers'. Autoencoders [63] and RNNs are almost equally present. The least applied DL models in our dataset are ANN and GNN. The category 'Other' includes deep belief networks, neural machine translation, and reinforcement learning. It should be noted that counts in Figure 9 add up to substantially more than the total number of studies in our dataset (103) as many papers in practice combine multiple DL models. Particularly, we observe that there are specific DL models that are commonly used together for solving specific downstream tasks, such as studies that use attention mechanisms. The attention mechanism emerged as an improvement over the encoder decoder-based neural machine translation system based on encoder-decoder RNNs/LSTMs. Both encoder and decoder are stacks of LSTM/RNN units. Further, hybrid DL models are commonly used for tasks in the codetext or text-code groups as these require different models for different input and output. These issues will be discussed in more detail in Section 6.1.

RQ 1.2 Summary Software engineering research uses a wide
variety of DL models with LSTM and attention mechanisms currently receiving most attention.

| Source code representation
We now turn towards what representations are being used in conjunction with these DL models to answer RQ1.3. We analysed the source code representation approaches that are utilised to encode source code into a form that is meaningful and can be fed into ML models. Three primary (groups of) techniques have emerged from our analysis: token-based representation, tree-based representation, and graph-based representation. Five concrete representation approaches emerged that do not clearly belong into any of these groups and have hence been categorised as 'Other'. These are code gadget (the number of lines of code that are semantically related to each other [64]), binary visualisation (the raw representation of any type of file stored in the file system, which exhibits similar behaviours of the code while being syntactically different [65]), ASCII which used by Wang et al. [66] to convert each letter of JavaScript code into eight bit binary, latent semantic indexing (LSI, a method of analysing a set of documents in order to discover statistical co-occurrences of words that appear together which then give insights into the topics of those words and documents [47]), and bytecode (in this representation, a code fragment is expressed as a stream of bytecode mnemonic opcodes forming the compiled code [67]). An overview over the prevalence of the four groups is given in Figure 10.  All three groups see frequent use in software engineering. Tree-based and token-based representations are most common and are both utilised in over half of the studies in our dataset (66% or 64% and 54% or 52%, respectively). As before, some studies employ multiple representation approaches simultaneously. Graph-based approaches are less common and only used in 25 (24%) of studies, but the usage is increasing. The remaining techniques are only used in five individual publications.
For tree-based representation, the only specific technique that emerged from our study is AST. However, both tokenbased and graph-based representations can be split up into further subcategories. For token-based approaches, these are word embedding and n-grams, with word embedding being the dominant technique (used in 37% or 79% of the studies using a token-based approach, see also Figure 10).
There are a larger number of choices of graph-based representations, which are depicted in Figure 10. The most common ones are CFG (17% or 45%). Other options include PDG, DFG, and CallFG.

| Alternative representation approaches
In contrast, some studies have made use of code representation approaches without direct adoption of any of the methods that are categorised in Table 1. To take token-based approaches as an example, some works have tokenised the text without using word embedding or n-grams techniques. In a study by Fernandes et al. [68], the proposed framework breaks up all identifier tokens (i.e., variables, methods, classes, etc.) of the source code into sub-tokens by splitting them according to specific heuristics (camelCase and pascal_case). Gupta et al. [69] use an encoding map for each programme to map every token, based on its type (such as function, literal, variable, etc.), to a unique name in a pool of names. Similarly, there is a subset of graph-based solutions that have not used any graph-based methods that are classified in Figure 10. Yasunaga and Liang [70] have proposed a programme-feedback graph to model the reasoning process and capture the semantic correspondence involved in programme repair. Similarly, Fernandes et al. [71] extend sequence encoders with a graph neural network that can reason about long-distance relationships. Finally, Brockschmidt et al. [72] decode the code in a graph representation using GNN for partial programs to incorporate rich semantic information that is useful in programme repair tasks.

| Code representation depending on codelevel granularity
Another question our review can answer is whether different code representation approaches are more commonly used to handle code on the statement or method levels. The results of this analysis are shown in Table 3.
As we can see, there is no clear-cut difference in the usage of representation approaches depending on the code level. However, token-based approaches are slightly more commonly used in studies that work with code at a statement level. This intuitively makes sense as such studies are less concerned with preserving the syntactical or semantic context of a software project.

| Code representation for different programming languages
As a final exploration of code representation approaches, we map which programming languages the studies in our corpus use. This is shown in Figure 11. Unsurprisingly, Java is by far the most commonly considered programming language and is considered in over half the studies in the corpus (58 studies, or 56%). This can be explained by the wide availability of parsing tools that parse Java code into AST, which is compatible with the findings in Figure 10 that show AST to be the most common representation approach. Examples of common Java parsers are the Eclipse Java development tools (JDT) used by Büch and Andrzejak [73], SrcML [74] used by Bui et al. [4], or JavaParser used by Alon et al. [75]. However, SrcNL is a universal AST system that uses the same AST representations for multiple languages (Java, C#, C++, and C) For graph-based representation approaches, different tooling is required. For example, Ben-Nun et al. [48] convert Java code to statements in an Intermediate Representation (IR) using the LLVM Compiler Infrastructure [76], which is then processed to contextual flow graphs. Mehrotra et al. [6] use the Soot optimization framework [77] to build program dependence graphs for Java code, followed by the Cytron's method [78] to compute control dependence. Reaching definition [79] and upward exposed analysis [80] are both used for computing data dependence graphs.

| DETAILED ANALYSIS BASED ON SOFTWARE ENGINEERING TASKS
So far, our analysis discussed the three main dimensions of the study (tasks, DL models, and code representation approaches) in isolation. Now, we turn to investigating the interplay between these dimensions as part of RQ2. Particularly, we investigate how DL models and chosen representation depend on tasks (Sections 6.1 and 6.2, respectively).

| Software tasks and DL models
In this section, we will discuss the results that explain RQ2.1, where we map the chosen DL models to tackle software engineering tasks. Figure 12 depicts a mapping of specific DL models identified in the study to the four high-level categories of tasks as a bubble plot. We observe that a wide variety of models have been applied to the tasks in the code-code group, whereas there appears to be more dominant methods for code-text (LSTM with autoencoders and attention mechanisms) as well as codeprediction (CNN and LSTM). The data for the text-code group are too sparse to come to a clear conclusion, but initial evidence suggests that researchers also use a variety of models for this task. Further, LSTM is commonly used and proportionally distributed for all types of tasks. However, CNN is most frequently used for tasks in the group codeprediction. Both, autoencoders and attention mechanisms are used frequently for code-code and code-text tasks, but rarely for other tasks. Figure 13 drills deeper into this and depicts the usage of different DL models for specific tasks in the code-code group. We observe that a variety of models are used for all specific tasks.
In programme repair, some approaches use sequence to sequence networks with encoder-decoder models attached with attention mechanisms. Bi-directional LSTM is mainly used in both encoder and decoder. However, attention might be used in the decoder part [40] or in encoder [70]. However, other approaches for handling programme repair do not rely on the encoder-decoder model. For example, Vasic et al. [81] use LSTM and attention mechanism to locate and handle the misuse of the variable defined in the programme. Other studies rely on sequential models for handling programme repair without using the encoder-decoder attention model [69,82], whereas Dinella et al. [83] rely on graph neural networks for learning graph transformation to repair the bugs in the Java-Script programs. Figure 14 presents a similar analysis for specific code-text tasks. It becomes evident that autoencoders are an important facet of contemporary code summarisation research. These approaches are based on the sequence-to-sequence paradigm over the words of some text with a sequence encoder (typically a RNN, but sometimes using self-attention [12]) processing the input and a sequence decoder generating the output. Recent successful implementations of this paradigm have substantially improved performance by focussing on the decoder, extending it with an attention mechanism over the input sequence and copying facilities [68]. However, while standard encoders (e.g., LSTMs) can in theory handle arbitrary long-distance relationships, in practice, they often fail to handle long texts (summarisation output) correctly [84].

| Software tasks and code representation
We now turn towards RQ2.2 and explore how the choice of code representation approach is impacted by the chosen software engineering task. An overview for the four groups of tasks is provided in Figure 15. We observe that the various code representation approaches are used across software engineering tasks. Text-code tasks are commonly addressed using token-based approaches. Only one study uses a tree-based approach for this type of task [10], and none uses a graph-based approach. However, this study handles multiple tasks within the same study. More specifically, the authors have built multiple representations to handle tasks separately. The tree-based approach addresses code summarisation (a code-text task), whereas a token-based approach is used for code retrieval (text-code). Hence, we conclude that for text-code tasks, for example, code search, a token-based representation is the only method that is seeing current use. This can be explained as the freeform text of, for example, a query is better treated using natural language processing (NLP) techniques than the more code-specific treeand graph-based representations.
Graph-based approaches are most commonly used in code-code tasks. However, also 38% of graph-based F I G U R E 1 4 Applied DL approaches for specific code-text tasks. ANN, artificial neural network; CNN, convolutional neural network; DL, deep learning; GNN, graph neural network; LSTM, long short-term memory; RNN, recurrent neural network F I G U R E 1 5 Code representation approaches per group of software engineering tasks F I G U R E 1 3 Applied DL approaches for specific code-code tasks. ANN, artificial neural network; CNN, convolutional neural network; DL, deep learning; GNN, graph neural network; LSTM, long short-term memory; RNN, recurrent neural network approaches are used for code-prediction tasks. To better understand this observation, we have again detailed further into specific tasks. In Figure 16, we present how often specific tasks in the code-code groups use a graph-based approach to represent the source code.
Both code clone and code similarity detection are proportionally overrepresented here. This is interesting, especially since these tasks have many similarities. It can be argued that a graph representation is highly appropriate for solving the problem of identifying similar code elements. By representing code snippets as a graph, those graphs are embedded into vectors (one vector for each graph). To measure the similarity, one can then simply compute the distance between those graphs. This approach is arguably more simple and effective than breaking each piece of code into tokens and then embedding each token into a vector.
We now conduct a similar analysis for the usage of graphbased representations in code-prediction tasks (Figure 17). We observe that graph-based representation approaches are commonly utilised in vulnerability and bug detection, together amounting to about two thirds of all usage of graph-based representation in code-prediction tasks. For these approaches, researchers commonly need to preserve semantic information for which graph representations are most suitable.

| MAIN ATTRIBUTES -CROSS ANALYSIS
We now discuss the interplay of all three dimensions of this study-tasks, DL approach, and code representation approach, answering RQ3. In the previous sections, we have separately analysed the three dimension task, DL model, and code representation approach. To answer RQ3 and get deeper insights into the current trends in the field, we now investigate all three dimensions together. Results of this analysis are summarised in Table 4. LSTM is the most commonly used model for code-code tasks, using both tree-and token-based representations as well as, to a certain extent, autoencoders. In contrast, LSTM and autoenconders are almost equally frequently used for codetext tasks. LSTM and autoencoders go hand to hand in solving sequential problems by treating the code as a sequence of tokens (using a token-based representation) or sequence of nodes (in the tree-based representation). Hence, sequential models, such as LSTM, are the most appropriate approach for such a problem. The sequential model needs to be encapsulated into an encoder-decoder model because for a code-text task, it is necessary to encode the code consistently through one model in order to generate natural language sequences from the corresponding source code. Attention mechanisms are used to dynamically select the distribution over the combined representations while decoding or encoding is selecting the relevant path in the AST [85].
Unsurprisingly, GNN is the most commonly used architecture in conjunction with a graph representation in the majority of SE tasks. In contrast, no common DL models can be identified for text-code tasks across all the representations. Instead, various different models are used across the studies in our dataset. This is because the text that represents the input in a text-code task can be treated using natural language processing (NLP) techniques, which according to literature, all DL models work properly on.
As for code-prediction tasks, CNN is the most dominant model in conjunction with a tree-based representation, while LSTM is most commonly used for a token-based representation. This difference is rooted in the different goals underlying tasks in the code-prediction group-in these tasks, the goal is not to generate code as in code-code and text-code tasks, or generating text as in code-text tasks. Much more, codeprediction tasks tend to deal with classical DL prediction problems, that is, classification and regression. For instance, bug or vulnerability detection is a binary classification problem to decide whether or not the code includes a bug or vulnerability. The same is true in performance prediction, where a specific performance value is predicted as a regression problem.

RQ 3 Summary
We analysed the retrieved frameworks from the viewpoint of the main three dimensions of our study, software task, code representation approach, and deep learning model applied. LSTM and autoencoders are the most used deep learning for code-code and codetext tasks using tree-based and token-based representations. While GNN is the most used model with graph representation with most of the SE tasks. For codeprediction, CNN with tree-based representation and LSTM with token-based representation are the most common techniques used in the studies.

| ANALYSIS OF HYBRID APPROACHES
In this section, we will answer RQ4 by exploring frameworks that address either multiple SE tasks or which use multiple representations. We refer to such studies as using a hybrid approach.

| Hybrid software tasks within one framework
In this section, we address RQ4.1 and identify characteristics and main properties of frameworks that solve multiple SE tasks simultaneously. No study in our dataset is general in the sense that it is able to addresses all SE tasks.
8.1.1 | Solving many tasks with one framework Bui et al. [10] propose an approach that integrates three different tasks-it tackles code similarity detection as a codecode task, code search as text-code tasks, and code summarisation as a code-text task. This study is singular in that it combines text-code tasks with other SE tasks. The proposed method is a self-supervised learning framework for source code modelling designed to mitigate the need for labelled data for different SE tasks. The key innovation here is that the source code model is trained to detect the similarity and dissimilarity across code snippets. This study also makes use of a hybrid representation approach, by merging an AST-based strategy with a token-based approach. The representation approaches are used in the encoder component of the discussed system. Hence, well-know AST-based code modelling techniques, such as Code2vec [44], TBCNN [3] are used besides token-based approaches by handling the source code as a sequence of tokens using a neural machine translation (NMT) baseline. Those techniques utilise node type and token information to initialise AST nodes. The hybrid representation approach will be discussed in more detail in Section 8.2. Throughout this approach, various encoders are used, and the choice of encoder depends on the task.

| Frameworks that solve two tasks
Besides the aforementioned study, we find that three other approaches tackle combinations of code-code and code-text tasks. Cvitkovic et al. [60] design a framework that solves code completion as a code-code task and identifier generation as a code-text task. They use ASTs to represent the source code. This tree is augmented with semantic information, such as data-and control-flow to eventually obtain an augmented AST as a directed multigraph. The augmented AST is then further augmented by adding a Graph Structured Cache. They add a node to the augmented AST for each token in the input instance. Then, all the nodes are vectorised to be processed than with the graph neural network. Kang et al. [86] evaluate the generalisability of the Code2vec modelling technique by applying it along with a sequential model to address code clone detection as a codecode task as well as code summarisation as a code-text task. Then, they compare the results obtained from these techniques with a task-specific baseline. In this study, the authors do not focus on the overall effectiveness of the methods. Instead, they evaluate if the use of Code2vec can improve the performance of the baselines. Based on their results, the authors claim that no improvements had been achieved by applying Code2vec.
Code summarisation is also investigated through one framework proposed by Wei et al. [87] that is generalised to solve programme synthesis as a text-code task. They use a token-based approach for code representation. The proposed framework consists of three main parts: a code summarisation model, a programme synthesis model, and dual constraints. The code summarisation and programme synthesis models both rely on a sequence-to-sequence neural network and an encoder-decoder attached with attention mechanism between encoding and decoder. To leverage the contextual information within the word embedding, a token-based, bi-directional LSTM is used as a unit in the encoder. Another LSTM is also used in the decoder. Dual constraints are used by adding regularisation terms in the loss function to constrain the duality between two models, which are enlightened by the probabilistic correlation and the symmetry of attention weights between code summarisation and programme synthesis models.
Finally, four studies design solutions that are transferable across code-code and code prediction tasks [21,[81][82][83]. Three of those proposed frameworks that tackle programme repair as code-code tasks and bug detection as code prediction tasks. These tasks are related in the sense that a bug is first detected in the code, which is subsequently fixed through programme repair. Hence, it makes sense to have one solution that addresses these tasks simultaneously. In the same context, one of those studies [83] uses a hybrid code representation approach by combining tree-and graph-based approaches. Thus, code is parsed into an AST to capture the programme's syntactic structure; then, the leaf nodes are connected with SuccToken edges. Additionally, the value of nodes that store the content of the leaf nodes is added with special semantic ValueLink edges connecting them together. Based on the study, the ultimate aim of introducing this additional set of nodes is to provide a name-independent strategy for code representation and modification. After representing the programme as a graph, a GNN is used to map the graph into a fixed dimension vector space. An LSTM is then trained to locate the bug through a sequence of graph transformations. That means that, given a buggy programme modelled by a graph structure, the proposed framework makes a sequence of predictions, including the position of bug nodes and corresponding graph edits to produce a fix.
The other related approaches [81,82] use only a tokenbased approach combined with LSTM to locate and repair the bug in the programme. Moreover, the fourth study in this group [21] defines an AST-based neural network for source code representation in order to solve code-clone detection as a code-code task and code classification as a code-prediction task. This study discusses the problem of the long depth of the AST, which causes a long dependency between the sequence of nodes, leading to vanishing problems when injected into the sequential model. Thus, the tree is divided into a sequence of small statement trees. Those trees are encoded to be used with a bidirectional RNN model to leverage the naturalness of statements to achieve the tasks. Statement trees are constructed using the preorder traversal algorithm.
It is interesting to observe that no study in our dataset proposes a framework that addresses a combination of codetext and code-prediction tasks, nor combinations of codeprediction and text-code tasks.

RQ 4.1 Summary The integration between multiple tasks
within one framework relies on the relatedness between these tasks. However, there currently appears to be no truly general framework for DL in software engineering, which could be applied independently of the tackled software tasks.

| Hybrid representation approaches
Some studies have utilised a hybrid approach for code representation to capture more information on the source code. This is often promising as tree-based approaches capture syntactical information, graph-based approaches are better at retaining semantics, and token-based approaches preserve lexical information. Table 5 summarises how often different types of code representation approaches are used alone or in conjunction. The diagonal elements represent the frequency of the frameworks that have used a single representation approach, while the nondiagonal elements represent the frequency of the frameworks that have used hybrid representations. Seven studies [5,7,21,42,67,88,89] combined representations from all three groups. The most common hybrid approach is a combination of token-and tree-based approaches, used by 25 studies, or almost a fourth of our dataset, in total (note that 18 approaches combine only treeand token-based representation, plus the seven studies that use all three). Combinations of tree-and graph-based approaches are also fairly popular, used by 16 studies in total.
Particularly interesting are the seven studies that have used all three representation approaches in conjunction. For example, Hua and Li et al. [7,42] present work on bug detection. The two approaches start with constructing AST representations of the source code in order to locate sensitive point-like object construction, method invocation, expression statement, conditional statement, and loop statements. Sensitive points are the syntax characteristics where most 'simple' bugs manifest. Then, Word2Vec [20], a token-based representation approach, is employed by taking all of the AST nodes of a method as the input and generating a learned vector representation for each given AST node. This vector representation is later used as input to the DL model. However, the local context of the method representation from AST node representations is preserved by representing each path as an ordered set of node vectors. Since the bug can be involved in multiple methods, it is crucial to capture also the global context by modelling the relations between different methods through a dependency graph (a PDG). Thus, semantic information in the source code, such as data and control flow, is traced. Then, when the graphs are generated, different embedding techniques for graphs are used on nodes, edges or the entire graph. For example, Node2Vec [90] is used to vectorise the nodes.
Similarly, other studies that use all three representation approaches are tackling code clone detection [5,21,67]. These studies show that using a stream of identifiers to represent the code, DL can effectively replace manual and hand-crafted feature engineering. Moreover, these works show that representation of the code at different levels of abstraction (identifiers, AST, and CFG) can provide a different, yet orthogonal, view of the same code fragment, thus enabling more reliable detection of code similarities.
Sonnekalb and Li et al. [42,89] investigate a combination of all three main representation approaches for the task of vulnerability detection. These studies claim that there is a need to represent programs in a way that can adequately accommodate the syntax and semantic information related to vulnerabilities. This enables multiple kinds of neural networks to detect various kinds of vulnerabilities.

| GAPS IN THE LITERATURE
In this section, we will discuss perceived limitations, research gaps, and challenges that we derived from the retrieved studies, addressing RQ5.
� Lack of Topic Coverage: Even though we have found DL to be applied to a wide variety of SE tasks, some crucial tasks appear to be underrepresented. For example, we have identified only one or two studies each tackling performance prediction, code smell detection, or traceability. This is surprising, as these tasks could profit substantially from an investment in DL. Taking performance prediction as an example, performance is often seen as a crucial nonfunctional property of software systems, and traditional performance engineering is challenging [91] and error-prone [92]. A deeper investment in DL in the style of some code clone detection or programme repair studies seems promising in these domains. � Lack of Generalisability: According to Figure 6, DL models can be used in two phases-in the data preparation and preprocessing phase for learning the representation of code (representation learning), and then again in the learning and validation phase to achieve the SE task. In principle, representation learning is independent of the tackled SE task. Transfer learning [93] could be used to generalise and reuse pre-trained models for representation learning to different tasks. In other application domains of DL (such as computer vision or NLP), transfer learning has led to generally useful models such as DenseNET [94] or BERT [95]. We observe a lack of such models in software engineering. However, we made the observation in this study that most of the proposed approaches are highly domain and problem-dependent. Thus, very few retrieved studies are applied to different SE tasks. Very few solutions are transferable or easily adapted to other SE problems. There are some approaches that explicitly present generalised SE representations [44,85]. However, these approaches are for fixed code units, such as tokens, statements, or functions. They are not sufficiently flexible to generate encoding and embeddings for different units. Thus, the learned code representation may not be effective for a multitude of tasks. Two studies in our dataset already attempt to provide such a generalised representation model [4,10]. We argue that this is an important area of future research that should be a focal point for future investigations. � Lack of Industrial Data: Unsurprisingly, the vast majority of approaches in our dataset are trained and tested on opensource projects extracted from platforms, such as GitHub. However, validation of the resulting models on industrial data is rare. This is understandable especially in supervised learning model, which requires annotated datasets of considerable size. Annotations often need to be manually labelled by humans according to a specific downstream task.
To address this challenge, and connecting to the previous point, recent research uses self-supervised learning [4,10]

| DISCUSSION
� Towards AST-Based Neural Networks: As our work shows, token-based approaches are common in software engineering literature. These approaches tend to either treat the code as a token sequence or bags of tokens, or they rely on latent semantic indexing (LSI) and latent dirichlet allocation (LDA) to represent the code. The problem of those token-based approaches is that they treat the source code as a natural language. To improve these approaches, code syntax and semantics need to be taken into account [96]. Some existing work [3,22,23] provide strong evidence that syntactical knowledge contributes positively and leads to better representations than traditional token-based methods. We speculate that this is the reason why ASTs are used in so many different approaches. Through the AST, researchers can easily capture lexical as well as syntactical information. Hence, many research works try to combine ASTs with deep learning, which is referred to as AST-based neural networks. Theses approaches combine ASTs with Recursive Neural networks (RvNN) [22], tree-based CNNs [3], or tree LSTMs [23]. � The Limitations of Tree-based Approaches: Despite the effectiveness of such tree-based neural network approaches in extracting both lexical and syntactical information, there are limitations. Similar to long texts in NLP, tree-based neural models are vulnerable to the gradient vanishing problem, where the gradient becomes vanishingly small during the training (especially when the tree is very large and deep, which it often is for real-life source code). Hence, traversing and encoding the entire AST tree in a bottom up way [22,23] or using a sliding window technique [3] may lose long-term context information [21]. Another limitation of AST-based neural networks is that those approaches transform the AST or present it as full binary trees to improve simplicity and efficiency. However, this in turn destroys the original syntactic structure of the source code and makes the AST even deeper. Moreover, the transformed and deeper AST reduces the capability of neural network models to capture more real and complex semantics [21]. Finally, some SE tasks require not only syntactical, but also semantic information. � Towards Graph-based Code Representation: Due to the problems of leveraging semantic information with ASTbased approaches, more and more newer DL papers adopt graph-based representations, such as long-term CFG and Data Dependencies Graph (DDG). These representation approaches can overcome some of the limitations of ASTbased neural networks. Examples of such works are Zhao et al. [25], who extract semantic features from the CFG of represented code, Allamanis et al. [33], who consider the long-range dependencies induced by the same variable or function in distant locations, or Tufano et al. [67], who directly construct CFGs of code fragments. � The Limitations of Graph-Based Approaches: However, graph-based representation is not with challenges either. The drawback of CFGs is that they lack data flow information. Furthermore, most CFGs only contain control flows between code blocks and exclude the low-level syntactic structure within code blocks [59]. Another drawback of CFGs is that in some programming languages, CFGs are much harder to obtain than ASTs. Nevertheless, Henkel et al. [97] show that embeddings learned from (mainly) semantic abstractions provide nearly triple the accuracy of those learned from (mainly) syntactic abstractions. Ultimately, many solution approaches choose to use a syntactic representation [75], because it was shown to be useful as a representation for modelling programming languages in machine learning models. It was also shown that they are more expressive than n-grams and manually designed features [44]. Other solutions use approaches based on semantic context [98] in which programme elements are graph nodes and semantic relations are edges in the graph. Due to the gap between syntax (e.g., tokens or ASTs) and the semantics of a procedure in a programme, the abstractions of traces obtained from symbolic execution of a programme are also used as a representation for learning word embeddings [97].
Based on the aforementioned discussion and the ongoing developments and current promising research directions, we expect a move towards more graph-based code representation as these representation models make it easier to learn semantic information. However, graph-based approaches are not without challenges, and more research in this direction will be needed.

| CONCLUSION
This study has presented a systematic mapping study on 103 primary studies that use code representation in the context of DL for software engineering. Our mapping study has classified the software task into four main categories depending on the input and output of the DL model (code-code, codeprediction, code-text, and text-code). Our study showed that code-code and code-prediction are the most addressed software tasks. We have also observed that tree-based and tokenbased approaches are the most common representation approaches applied in the investigated studies. However, we have also observed that there is a trend towards hybrid representations (which combine multiple different representation approaches) as well as the preferred usage of graph-based representations in newer studies. We identify two primary challenges in current literature: (1) there is a lack of generalisability of the presented approaches to other tasks (i.e., there are few attempts at transfer learning between tasks) and (2) very few studies validate the proposed framework on industrial datasets. We argue that these two problems constitute severe threats to the practical usefulness of current code representation research in the field of software engineering. ✓