Download
Seahawk Files (Eclipse Plugin + Importers).Deployment Instructions
In this tutorial we are going to install all the infrastructure needed by Seahawk to correctly work. This will required Apache Tomcat, Apache Solr a relational database (PostgreSQL or MySQL) and, obviously, Eclipse IDE with the plugin's jar file of Seahawk.
Preliminary Software Installation
Apache Tomcat and Apache Solr installation
The first step concerns the installation of the search engine. Seahawk relies on Apache Solr as a search engine. To install it you need to first download and install Tomcat. Then, you need to download and install Solr. Follow the tutorial on how to install it here.
Apache Solr Configuration
Once you get your Solr installation working, you need to configure the schema for Seahawk and some other files (e.g. stopwords.txt) involved by the filters used by the schema itself. Download the archive containing all the importers, scripts and files needed. Take the files needed in the solr_conf_files directory and copy them in the /conf directory inside your Solr home directory.
Database Installation
In the next step we will import the xml data into a database to reconstruct the original database of a stack exchange web site. The tools we will provide supports both MySQL and PostgreSQL.
First of all, create a database (e.g. stack_overflow) and then run the scripts in the database_scripts folder to generate the tables needed. In the archive you will find two scripts, one for MySQL and one for PostgreSQL. Use the one depending on the installation you made. As you will notice in the next step, the stack exchange dump provides data for all the website in its network (e.g. gamedev.stackexchange.com, stackoverlow.com etc...). You need to create one database for every site you want to import and use with Seahawk.
Data Pre-processing
Download the Stack Exchange Public Dump
In the second step, you need to get the Stack Exchange Dump that will provide the xml files needed to process the documents.
In the dump there are all the data concerning all the websites in the stack exchange network. That means that you will find more than the solely stackoverflow.com. However, int this tutorial we are going to just import the stackoverflow.com data dump but you can repeat the same steps also to import gamedev.stackexchange.com dump.
Import the xml data to PostgreSQL/MySQL
In the previous step you created the database and the related tables for stackoverflow.com. Now, it's time to populate the tables with the dump's data. First of all, deflate the 7zip files for stackoverflow.com, and locate three files: posts.xml, users.xml and comments.xml. When you got those files' location, run xml_importer.jar you can find it in the importers folder. Run it using the following options:
-a,--database_address database address (e.g. 127.0.0.1, localhost or mydbserver.org) default: localhost
-c,--comments_xml path to comments.xml
-d,--database database name
-h,--help display help information
-o,--database_port database port (default 5432)
-p,--posts_xml path to posts.xml
-s,--database_user database name
-t,--database_type database type: 'mysql'(MySQL) or 'pgsql' (Postgresql). Default: pgsql
-u,--users_xml path to users.xml
-v,--votes_xml path to votes.xml
-w,--database_pwd database password
Please note that the import phase takes about 5/6 hours to complete on a MacBook Pro (RAM: 4 Gb, CPU: 4 cores). This is due to the huge amount of data to process (more than 2 millions of documents in the December'11 dump). Regarding the JVM options, the tool must be used at least with 4 Gb of heap size (-Xmx4G option) or the tool is going to fail due to memory limitation otherwise.
Solr indexing and documents generation
When the xml importer tool has finished, we need to reconstruct the documents from the database and to index them in Solr that we have installed previously. Therefore, run solr_importer.jar you can find in the importers folder as follows:
java -jar <jvm options> solr_importer.jar <solr_url> <stack_exchange_site: stackoverflow or gamedev> <db_name> <db_user> <db_password> <db_address> <db_port> <db_type: mysql or pgsql>
Please note that this phase is going to take 7/8 hours on a powerful machine (e.g. 8 cores, 8Gb RAM).
Plug-in configuration
Solr Prefrences
Once the documents importing phase has done, copy ch.usi.inf.seahawk_1.0.0.jar into the eclipse's plugin folder. When you launch Eclipse for the first time, you will be asked to restart it. Do it in order to avoid Seahawk's undesired behaviors. After having restarted Eclipse, as requested by the plugin, go to the Eclipse's preference panel where you can find the Seahawk's panel. As shown in the image below, you can setup some preferences for Solr. Put the right URL where the Solr service is working such that Seahawk can communicate with it. You can also set the size of the result set to retrieve which is 50 by default.
As shown in the image below, in the main Seahawk's preference panel you can set up the name of the author that will be used by Seahawk in the annotations. you can choose between a custom username or to let the plug-in use the Eclipse username, that is, the value set in the "USER" system environment variable.
Seahawk's annotation preferences
Seahawk provides an annotation system to put bookmarks directly into the code. As shown in the image above, annotations delimiters can be customized in order to create new annotations for language not supported by default. In order not to compromise source code compilation, delimiters must be an extension of the multiline comments for a language. For example, if we consider Java, we know that delimiters for multiline comments are /*
and */
. By default, Seahawk extends those delimiters by using /*!
and */
to delimit its own annotations in Java. Moreover, in the annotation is present a @author
tag. From the preference panel above, an user can decide if customize the author's name or using the one provided by system environmental variables (e.g. the one used by eclipse).