Xml Orientend Gcc AST ANalyzer

Introduction

HERE YOU WILL LEARN HOW XOGASTAN WORKS


About XML technologies used

XOgastan uses the nearest technology: c++ stl, XML, linux. In fact, we decided to use the XML technology for several serious reasons:
  • XML can be used to represent simple or complex data structures.
  • XML can be used to represent data that can be shared among applications.
  • XML will be one of the key technologies of future Web applications.
  • XML is very flexible.
  • XML has reached a good level of maturity.
Currently there are a lot of api, produced by different vendors or groups, that implements the DOM. We choose the exerces-c api of the apache project. Now, we list the purposes of XOGastan API:
  • XOGastan API must take an input file with the ast.
  • XOGastan API could help to produce data that can be shared with other applications.
  • XOGastan API should use standard technologies!
So, if we analyze the previous three points we can deduce: The ast dumped by gcc doesnt use a standard format (it is a format designed by gcc designers) We translate this format in a more standardized one: gxl, which is in fact an XML dtd (document type definition). If XOGastan shares data with other applications, then it will be a good idea to use a standard language. We define some dtd to represent the structure of the output of XOgastan, and we produce output in XML oriented way.

Go to the top

About others C analizer

By habitat of a program P we mean the set of programs that are ancestry of P or has the same purpose (functionality) of P.
This is the first version of XOGastan and then we can tell that XOGastan is the base program of its genealogy. However, there are many projects that work as XOGastan, including following: gasta, cppx, DATRIX. There are lots of differences between XOgastan and the previous projects:
  • XOGastan versus DATRIX. The input file, the representation used, the resulting data are different. Datrix is designed to analyze the C/C++ source code and not the intermediate representation of an ast generated by a compiler. Datrix does not analyze an ast, but an asg that is an evolution of the ast. Moreover, Datrix is also oriented to the analysis of C++ programs.
  • XOGastan versus cppx. Both programs analyze the ast produced by gcc, but there are differences: the languages used and the purpose. Cppx is written in C. The purpose of cppx is transforming the ast of gcc into a Datrix asg, and the output is a graph represented by using the gxl language. The only point in common between XOgastan and cppx is the XML representation of the graph.
  • XOGastan versus gasta. Gasta is a software for the analysis of gcc dumped by gcc. The differences between XOgastan and gasta are: the purpose is the same, but the techniques and the technologies used to achieve the purpose are different. Gasta is written in C, XOgastan is written in C++. Both software are intended to be used by developers. XOgastan is XML oriented and gasta is not.
At the end we can tell:
"XOGastan uses the same representation used by cppx and its purpose is the same of gasta: obtaining information about a C program !

Go to the top

About the internal of XOGastan

XOGastan is an XML oriented application because its input and its output are xml files. So, XOGastan takes in input an XML representation of the ast dumped by gcc. The point is: the ast dumped by gcc is not represented using an XML format! How do we translate the format dumped by gcc into a XML format? Which XML format do we choose? After a search on the web, and some advices of an expert in this field we choose the gxl dtd. Now, when we have the initial format, the one produced by gcc, and the final format, the gxl dtd, and we can describe how the translation is made. We wrote a perl tool called gcc2gxl. It includes a perl scripts and a translation table: gcc2gxl.pl and g2x.map. The script gcc2gxl.pl takes in input the file foo.c.tu dumped by gcc and produces an output file with the ast represented in gxl format.

1. Gcc has a node to represent the function declaration, its code is function_decl. This node contains information about the status of the function declaration: static or extern memory class, the name of the source file where is declared, the number of line in the source file where the declaration is. Moreover, this node has different links: link to the node with the function name (an identifier_node), link to the first node of the body (compound_stmt), link to the next declaration in the same scope (can be any declaration node),

2. The following lines are an example of the information dumped gcc about the function_decl:

@15 function_decl
name: @29 mngl: @30
type: @31 srcp: div.c:101
chan: @32 args: @33
static body: @34

where @15 is the index (unique in the dumped unit file) of the function_decl, @29 is the index of the node with the name, and so on

3. Some of translation rules for the function_decl are:

case FUNCTION_DECL:
name:*%<edge from="index" to="*"><type xlink:href="gccast.xml#name"/></edge>
type:*%<edge from="index" to="*"><type xlink:href="gccast.xml#type"/></edge>
scpe:*%<edge from="index" to="*"><type xlink:href="gccast.xml#scope"/></edge>
srcf:*%<attr name="source_file"><string>*</string></attr>
srcl:*%<attr name="source_line"><int>*</int></attr>
artificial %<attr name="flag"><string>artificial</string></attr>
chan:*%<edge from="index" to="*"><type xlink:href="gccast.xml#next_decl"/></edge>

args:*%<edge from="index" to="*"><type xlink:href="gccast.xml#arguments"/></edge>
undefined %<attr name="flag"><string>undefined</string></attr>
extern %<attr name="flag"><string>extern</string></attr>
static %<attr name="flag"><string>static</string></attr>
body:* %<edge from="index" to="*"><type xlink:href="gccast.xml#body"/></edge>
fn:*%<edge from="index" to="*"><type xlink:href="gccast.xml#body"/></edge>

Each rule has its format and we have two different formats. In this document we do not talk about these formats. For more info see the on-line doc of XOgastan.

4. At the end, the final result of the translation will be:

<node id="15">
<type xlink:href="gccast.xml#function_decl"/>
<attr name="source_file"><string>div.c</string></attr>
<attr name="source_line"><int>101</int></attr>
<attr name="flag"><string>static</string></attr>
</node>

<edge from="15" to="29"><type xlink:href="gccast.xml#name"/></edge>
<edge from="15" to="31"><type xlink:href="gccast.xml#type"/></edge>
<edge from="15" to="32"><type xlink:href="gccast.xml#next_decl"/></edge>
<edge from="15" to="33"><type xlink:href="gccast.xml#arguments"/></edge>
<edge from="15" to="34"><type xlink:href="gccast.xml#body"/></edge>

The file g2x.map is the result of a long analysis of the ast gcc. We studied the source code, the documentation file and at the end we got a comprehensive set of translation rules. The most important files of gcc project we studied are: tree.def, tree.h, c-common.def, cptree.def, cptree.h, dump.h, dump.c, cpdump.c,

It has been a very hard work !

Go to the top

About the analisys made by XOGastan

We can tell that the analysis of XOGastan is function-oriented! In other words, XOgastan can get a lot of information about the functions present into the nast (or the ast, that is the same). This version of XOgastan helps to analyze the nast by searching function declarations and for each one of these it help to perform same further analysis:
  • Gets the name, the type returned, the parameter list ... of the function.
  • Gets some information about the statements into the function's body: pick up statistics of the statements used, build graph of the body tree produced by gcc, build a control flow graph, ...
  • Gets the list of declarations into the body of function: variable declarations, typedef declarations, ...
  • Gets statistic about the number of expressions and operators ...
  • Gets a list of the variables used into the expressions: variables with local scope, variables with global scope, parameters.
  • Builds a call-graph of the function called.
XOgastan helps to pick up also statistical information about nast: total number of nodes, frequency of a node, Now, you could ask us how XOGastan collects these data ! Simple, it uses an analysis-pattern called visitor-pattern (remember that XOgastan uses an object-oriented abstract syntax tree, and remember that C++ is a very powerful language). You can understand by reading the following situation:

"A man, with a big drawer, walks inside a big palace which consists of lots of coloured rooms (at least 2000 rooms). Each room has its colour, not all adjacent rooms have the same colour, and we have only one hundred colours ! The man searches some objects, ... and each typology of object is always present into the rooms with the same colours. During the visit the objects are taken by the man and put into the drawer ! To visit the palace the man reads some instructions written by a friend !"

Maybe it is a very stupid example but, if you think ...
  • The NAST is like a big palace.
  • A node is like a room.
  • The code of the node is like the colour of a room.
  • The visitor is like a man.
  • The single data are the objects collected by the man.
  • The structure used to hold the collected data is like the drawer.
  • The strategy of the visit is the instruction's list used by the man.
  • The programmer is the friend !!
Ok, this is the visitor-pattern used by XOgastan !

Go to the top