ARX and Phoebus

This release presents two implementations for performing information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum posting titles. The ARX system is an automatic approach to exploiting reference sets for this extraction. The Phoebus system presents a machine learning approach exploiting reference sets.

ARX: Automatic Reference-set based eXtraction

The ARX approach is described in:
Matthew Michelson and Craig A. Knoblock, Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web, International Journal of Document Analysis and Recognition (IJDAR), Special Issue on Noisy Text Analytics, 10, p.211-226, 2007
Which you can read here (bibtex).


Phoebus

The Phoebus approach is described in:
Matthew Michelson and Craig A. Knoblock, Creating Relational Data from Unstructured and Ungrammatical Data Sources, Journal of Artificial Intelligence Research (JAIR), 31, p.543-590, 2008
Which you can read here (bibtex).

To build/run the software, download ARXPhoebus.zip and unzip it. From there, all instructions for configuring the system to run are presented in detail in the README.txt file. To save you some time, you will need Java and MySQL to run the software. Please direct all inquiries to Matt Michelson (here).