the ADIOS algorithm

Version: 1.00 Date: 29/1/05

This software is available for non-commercial use only. It must not be modified or distributed without a written permission of the author. This software comes without any warranty; use it at your own risk. If you do use it in your scientific work, please cite the appropriate publication(s), and include a pointer to this website.

The binaries of ADIOS-Lite (A demo version which limits the number of extracted patterns to 100) are available after a short registration.

download ADIOS-Lite 1.0 Linux Version

download ADIOS-Lite 1.0 Cygwin Version

A windows version will be available soon.

The algorithm section contains the following:

Installation

How to use the ADIOS software (a reference guide)

adios.exe

create_graph.exe

scrambler.exe

convert_grammar.exe

Step by Step

Preprocessing

Learning

Generating new sentences

Importing external grammar

Testing

Multi-learners

Printing

Installation

To install ADIOS, you need to download adios-1.xx-linux.zip. Create a new directory

mkdir adios

Move adios-1.xx-linux.zip to this directory and unpack it with

unzip adios-1.00-linux.zip

How to use the ADIOS software (a reference guide)

QUICK START

Run the shell script:

train.sh proj_name ; e.g. ./train.sh TA1

The shell script will first convert the corpus TA1.corpus.txt (which is provided with the package) into a graph. It will then run the ADIOS algorithm for few seconds and print the results into a text file: TA1.results.txt

If for any reason you stopped the algorithm and would like to resume it, use the following script:

resume.sh proj_name ; e.g. ./resume.sh TA1

The algorithm can generate new sentences according to the rules it has acquired. Use the following script to start generating sentences:

generate.sh proj_name; e.g. ./generate.sh TA1

The newly generated sentences will be found in the file TA1.generate.txt

The TA1 corpus, was generated by an external CFG grammar. Two measures are used to assess the performance of ADIOS as an unsupervised learner: Recall and Precision (for further details see below). The following shell runs a Recall/Precision test which measure the learning performance of ADIOS when applied on the TA1 context free grammar.

teacher-learner.sh proj_name ; e.g. ./teacher-learner.sh TA1

if everything works properly after 1 minute approximately, the system will report the following:

results: TA1
Recall = Accepted: 472 out of 500 (0.944)
Precision = Accepted: 404 out of 500 (0.808)

which means that in this particular run, ADIOS achieved 94% recall and 80% precision. Pay attention that since the algorithm generates a new corpus for each run the numbers might be a bit different.

ADIOS USAGE

The main tasks that the ADIOS program can execute are:

be trained on a given graph (see Learning Arguments)

be tested given a new unseen corpus (see Testing Arguments)

print and visualize the acquired patterns (see Printing Arguments)

generate new sentences using the acquired patterns (see Generating Arguments)

The program is invoked as follows:

adios.exe [-options] -o project_name

Example:

./adios.exe -a train -i TA1.idx -g TA1.grp -E 0.8 -S 0.01 -o TA1

Learning Arguments:

-a command     -> select between training (train) testing (test)

                  printing results (print) generating sentences

                                                (generate)

-o output_file -> file to store output results (default: out.txt)

-i index_file -> the indices file

-g graph_file -> the graph file ; when the -g option is omitted

                  the algorithm tries to resume into its last

                  stored sate.

-E [0..1]      -> the ETA parameter (Default: 0.8)

-S [0..1]      -> the patterns p-value (Default: 0.01)

-C [0..1]      -> the minimal coverage required from a formed

                  Equivalence Class (Default: 0.65)

-W [3,4,5...]  -> the context window size (Default: 5)

                  when -W is set to 1000, the algorithm runs

                  in MEX mode.

-r [0,1,2,3]   -> select between the rewiring modes: 0 none;

                  1 single; 2 multiple; 3 batch

-A [1..]       -> The largest pattern size: All patterns with

                  size larger than A will be treated as equal

                  in the rewiring process.

-B {1..]       -> The minimum pattern size

output files:

proj_name.trace.log: summary of the algorithm progress

proj_name.results.txt: list of the acquired patterns and properties

Testing Arguments:

-R int -> the parser max recursions (Default R=10)

-I test_file -> file with the testing data

output files:

proj_name.test.results.txt: For each input sentence the file provides a list of patterns and paths that span the sentence or parts of it.

proj_name.test.summary.txt: For each input sentence a list of patterns (IDs) that span the sentence or parts of it.

proj_name.test.classify.txt: Classification of the novel sentence to acceptable and non-acceptable according to the acquired grammar.

Generating Arguments:

-n int -> generate n novel sentences into the output file

-R int -> the generator max recursions (Default R=10)

output files:

proj_name.generate.txt: The corpus containing the new novel senetnces generated by ADIOS

proj_name.generate.log: A log file which describes the generate process.

Learning Additional Arguments:

-s int         -> start the training from the s search-path

-n int         -> process only the first n sentences in the

                  corpus

-c int         -> if no patterns are found in the next c

                  iterations stop (Default: 200)

-v [0,1]       -> verbosity level, 1 - on 0 - off (Default 0) A

ADDITIONAL TOOLS

CREATE_GRAPH USAGE

creates the ADIOS graph to be used by the adios.exe. The ADIOS graph is consisted of an index file project_name.idx and a graph file project_name.grp

./create_graph.exe -f corpus_file -o project_name

-f corpus_file -> the file name of corpus

-o project_name -> the name of the project.

Example:

./create_graph.exe -f TA1.txt -o TA1

SCRAMBLER USAGE

creates a new random order permutations of the corpus sentences.

./scrambler.exe

-f corpus_file -> the file name of corpus

-o output_file -> the output file.

CONVERT GRAMMAR USAGE

convert an external CFG into the ADIOS files

./convert_grammar.exe

-l lexcion_file    -> the lexicon file name

-g grammar_file    -> the grammar file name

-o proj_name       -> the name of the project

Step By Step

PREPROCESSING:

The aim of the preprocessing stage is to convert a corpus into a format that is compatible with the ADIOS algorithm. The program loads a corpus file (e.g., corpus.txt) and converts it into two files: an index file and a graph file. In what follows, the TA1 corpus (included in the ADIOS package) serves as a running example:

* Jim and Cindy have a winning personality #

* Beth won't be released until Friday #

* a horse barked #

* the dog loved a cat #

* the cats are living very far away #

Each sentence must start on a new line, and each primitive string (e.g., a single word) must be delimited by spaces. The '*' and the '#' symbols (meaning BEGIN and END of a sentence, respectively) must be used. The following command:

./create_graph.exe -f TA1.txt -o TA1

creates two files: an index file (TA1.idx), which is a text file containing all the lexicon entries; and a graph file (TA1.grp), which is a text file describing the structure of the ADIOS graph.

LEARNING:

At the end of the preprocessing stage, ADIOS is ready to start learning. You can initiate the process as follows:

./adios.exe -a train -i TA1.idx -g TA1.grp -E 0.8 -S 0.01 -o TA1

During learning, the state of the algorithm is saved periodically into three different binary files: patterns.dat, graph.dat, and sysparams.dat. These binary files, together with the index file (e.g., TA1.idx), contain all the information necessary to restore the system's learning state, as well as test it, print the patterns, or generate sentences using the current grammar. Thus, at any time, you can halt the program and then restore it, as follows:

./adios.exe -a train -i TA1.idx

In this usage, when the graph file is not provided in the command line, the algorithm tries to use the existing (saved) binary files (graph.dat, patterns.dat and sysparams.dat), instead of starting a new session.

During learning, the algorithm updates the proj_name.trace file with its learning state. The file contains the following information: Number of iterations, Number of patterns that have been found, Number of Equivalence classes, number of edges in the graph, the entropy rate (bits), the size of the graph, and its patterns, and the compression rate.

An Example of a proj_name.trace.log file: ################################################################### ADIOS V1.00 (Automatic DIstilation Of Strctures)
Thu Mar 17 19:38:29 2005

Verbosity level: 0
Context window-length: 5
Coverage: 0.65
Eta: 0.5
P-value: 0.99
Rewiring mode: Single
itrs = number of iterations
#ptrs = number of patterns
#ecs = number of equiavlence classes
#edges = number of edges
entropy = entropy (bits)
size = size (bits)
comp = compression (%)
itrs #ptrs #ecs #edges entropy size comp
0    0     0    2852   1.1462   44256 100
100 27    14   957    0.545923 16640 37.5994
200 30    15   924    0.542127 16336 36.9125
300 31    16   916    0.542689 16320 36.8764
400 32    17   909    0.545649 16304 36.8402 ###################################################################

At the end of the learning process, the algorithm provide a file with details about the learning session, the patterns it has found and a spectra analysis.

an example of the proj_name.results.txt ###################################################################

ADIOS V1.00 (Automatic DIstilation Of Strctures)
Thu Mar 17 19:38:29 2005

Verbosity level: 0
Context window-length: 5
Coverage: 0.65
Eta: 0.8
P-value: 0.9
Rewiring mode: Single
Number of patterns: 36
Number of equivalence classes: 36
Initial Size: 14576 (bits)
Total size after learning: 14576 (bits)
Compression: 100%

Table of extracted patterns:
ID: unique id
seq: sequence
p-value: significance
gen: generalization factor [1..0]
len: length
occ: occurrences in corpus
ID   seq                     p-value gen len occ
P44 (the,E45)                0       1    2   157
E45 {bird,cat,cow,dog,horse,rabbit}
P46 (that,E47,thinks,that)   4.1e-07 1    4   127
E47 {Beth,Cindy,George,Jim,Joe,Pam}
P48 (E47,believes,E49)       0       0.5 4.1 153
E49 {that,P46}
P50 (that,P44,is,E51,to)     3.6e-05 0.8 6    32
E51 {eager,easy,tough}
P52 (E53,E54,that,P48)       0       0.06 7.1 49
E53 {Beth,Cindy,Jim,Joe,Pam}
E54 {believe,think,thinks}
P55 (E47,E54,P46)            1.1e-07 0.18 6    39
P56 (to,E57,is)              0       1    3    93
E57 {please,read}
P58 (that,E59,is,E51,to,E57) 4.7e-07 0.36 6    12

...

Pattern Spectra:
histogram of pattern types, whose bins are labeled by sequences such as (T,P) or (E,E,T) E standing for equivalence class, T for tree-terminal (original unit) and P for Pattern.

(T,T) 0
(T,E) 0.027777778
(T,P) 0
(E,T) 0
(E,E) 0
(E,P) 0.083333333
(P,T) 0
(P,E) 0.027777778
(P,P) 0.27777778
(T,T,T) 0.055555556
(T,T,E) 0
(T,T,P) 0.027777778
(T,E,T) 0.027777778
(T,E,E) 0
(T,E,P) 0

###################################################################

GENERATING NEW SENTENCES

The '-a generate' option generates new sentences using the acquired patterns. The -n option determines their number. The following command creates 1000 novel sentnces.

./adios.exe -a generate -i TA1.idx -n 1000

The sentences can be found in the TA1.generate.txt file.

###################################################################
* that the rabbit is easy to please worries Jim #
* Jim believes that George thinks that to please is easy #
* Beth think that Cindy believes that Joe thinks that to read is easy #
* that the rabbit is easy to please bothers the cat #
* Jim thinks that Beth believes that Cindy thinks that the rabbit barks and t
he cat jumps , doesn't she ? #
* Beth thinks that Joe thinks that Pam thinks that to read is tough #
* Jim believe that Pam thinks that to please is tough #
* that Joe is eager to read worries Cindy #
* Joe believes that Joe thinks that to read is tough #
* that Joe is easy to please annoys the horse #
* Cindy believes that Jim thinks that Pam thinks that to please is easy #
* Jim believes that Jim thinks that George believes that Cindy thinks that to
read is easy #
* that the bird is tough to please disturbs the bird #

###################################################################

The generate option creates a log file, TA1.generate.log, which shows the order of the DFS search that has produced the sentences. Each sentence starts on a new line, and contains in parentheses the list (ID, lexicon_entry; depth), which records the order of the search. In that list, ID is the unique ID number of the terminal or the pattern, lexicon_entry is the terminal string (for patterns the label is empty); depth is the current depth of the search.

###################################################################

(2 : *;0) (63 : ;0) (50 : ;1) (36 : that;2) (44 : ;2) (37 : the;3) (45 :;3) (33 : rabbit;4) (28 : is;2) (51 : ;2) (25 : easy;3) (41 : to;2) (57 :;1) (32 : please;2) (64 : ;1) (43 : worries;2) (8 : Jim;0)
(2 : *;0) (60 : ;0) (48 : ;1) (47 : ;2) (8 : Jim;3) (15 : believes;2) (49: ;2) (46 : ;3) (36 : that;4) (47 : ;4) (7 : George;5) (40 : thinks;4) (36 : that;4) (56 : ;1) (41 : to;2) (57 : ;2) (32 : please;3) (28 : is;2) (25 : easy;0)
(2 : *;0) (72 : ;0) (52 : ;1) (53 : ;2) (5 : Beth;3) (54 : ;2) (39 : think;3) (36 : that;2) (48 : ;2) (47 : ;3) (6 : Cindy;4) (15 : believes;3) (49: ;3) (46 : ;4) (36 : that;5) (47 : ;5) (9 : Joe;6) (40 : thinks;5) (36 :that;5) (56 : ;1) (41 : to;2) (57 : ;2) (34 : read;3) (28 : is;2) (25 : easy;0)

###################################################################

IMPORTING EXTERNAL GRAMMAR

Any external CFG grammar (which may include loops) can be converted into the ADIOS representation of graph and patterns. This feature can be used to test the precision of the learning of an artificial CFG grammar. The format of the external CFG grammar is described here. The utility convert_grammar.exe requires two files as input: (1) a lexicon file, and (2) a grammar file. The following command converts the grammar TA1 into ADIOS graph and patterns (the utility creates graph.dat, pattern.dat and TA1.idx files).

./convert_grammar.exe -g TA1.cfg -l TA1.lex -o TA1

If the conversion process is completed without an error, ADIOS can treat the converted grammar as if it were acquired (e.g., use it to print, generate, test, etc.)

TESTING

Two measures are used to assess the performance of ADIOS as an unsupervised learner of natural language and artificial grammars: Precision and Recall. In the artificial-grammar experiments, which start with a target grammar, a teacher instance of the model is first pre-loaded with this grammar. We illustrate this by converting the external Context Free Grammar TA1 into ADIOS representation:

./convert_grammar.exe -g TA1.cfg -l TA1.lex -o TA1

The resulting instance of ADIOS, also known as the Teacher, is now used to create the training corpus (C_training):

./adios.exe -a generate -i TA1.idx -R 10 -n 500 -o TA1

This command will generate a corpus of 500 sentences. The generation process may exceed the recursion depth threshold (set to 10). In such cases, the symbol ~ will appear as part of the sentence, indicating that it is incomplete. To remove those sentences from the corpus you can use the following command:

sed /\~/d TA1.generate.txt | head -n 400 > TA1.Ctraining.txt

The command removes the sentences with the ~ symbol, and stores the first 400 sentences in the TA1.corpus.txt file.

We will now create an additional corpus, C_target, that contains only novel sentences that do not appear in C_training. This corpus will be used later to test the recall of the learning process. We use the same procedure as described above to create C_target.

It is a good idea to store the Teacher files including its corpora in a zip file:

zip TA1.teacher.zip graph.dat patterns.dat TA1.idx TA1.Ctraining.txt TA1.Ctarget.txt

Once the corpus is ready, we can proceed and create the ADIOS graph, and then train the ADIOS algorithm, as explained in the PREPROCESSING and LEARNING sections. After training, the learner generates a test corpus C_learner:

./adios.exe -a generate -i TA1.idx -R 10 -n 500 -o TA1

The Learner files including its corpora can also be stored in a zip file:

zip TA1.learner.zip graph.dat patterns.dat TA1.idx TA1.CLearner.txt

These two corpora, C_learnerand C_target, are used to calculate precision (the proportion of C_learner sentences accepted by the Teacher), and recall (the proportion of C_target sentences accepted by the Learner). A sentence is accepted if it precisely fits one of the paths in the ADIOS graph (that is, it can be generated by the path).To measure the precision we use the following command (make sure that the dat files and the index files indeed belong to the Teacher; one way of doing that is to unzip the file TA1.teacher.zip ):

./adios.exe -a test -i TA1.idx -I TA1.CLearner.txt -o TA1

This command calculates the proportion of C_learner sentences accepted by the Teacher. The results appear in three different files:

TA1.test.results.txt:
For each input sentence the file provides a list of patterns and paths that span the sentence or parts of it.

###################################################################

INPUT: 3 BEGIN [2 ; * ] [6 ; Cindy ] [40 ; thinks ] [36 ; that ] [9 ; Joe ] [40; thinks ] [36 ; that ] [41 ; to ] [32 ; please ] [28 ; is ] [42 ; tough ]

PATTERN: 48
E47 : (1, 1) 1;
E49 : (2, 2) 1; (5, 5) 1;
that : (3, 3) 1; (6, 6) 1;
AFFINITY:
SPAN = (1, 3) WEIGHT = 1 : Cindy thinks that

PATTERN: 52
E53 : (2, 2) 1; (5, 5) 1;
that : (3, 3) 1; (6, 6) 1;
AFFINITY:
SPAN = (2, 3) WEIGHT = 1 : thinks that
SPAN = (5, 6) WEIGHT = 1 : thinks that

PATTERN: 54
to : (7, 7) 1;
E55 : (8, 8) 1;
is : (9, 9) 1;
AFFINITY:
SPAN = (7, 9) WEIGHT = 1 : to please is

PATTERN: 66
E67 : (4, 4) 1;
P52 : (2, 3) 1; (5, 6) 1;
AFFINITY:
SPAN = (4, 6) WEIGHT = 1 : Joe thinks that

PATTERN: 70
P48 : (1, 3) 1;
E71 : (4, 6) 1;
AFFINITY:
SPAN = (1, 6) WEIGHT = 1 : Cindy thinks that Joe thinks that

PATTERN: 72
P70 : (1, 6) 1;
P54 : (7, 9) 1;
AFFINITY:
SPAN = (1, 9) WEIGHT = 1 : Cindy thinks that Joe thinks that to please is

Full span by path: 72 42
Accepted * Cindy thinks that Joe thinks that to please is tough # ###################################################################

TA1.test.summary.txt:
For each input sentence a list patterns (IDs) that span the sentence or parts of it.

###################################################################

192 [ 46 52 52 54 66 68 80 ]    Accepted
193 [ 46 46 52 52 54 61 68 80 ] Accepted
194 [ 46 48 52 52 54 64 70 72 ] Accepted
195 [ 48 48 52 52 54 79 84 ]    Accepted
196 [ 44 44 50 58 ]             Accepted
197 [ 44 44 50 56 ]             Accepted
198 [ 44 44 50 58 ]             Accepted
199 [ 46 52 52 54 66 68 80 ]    Accepted
200 [ 44 44 50 58 ]             Accepted
201 [ 46 48 52 52 54 64 70 72 ] Accepted
202 [ 46 48 52 52 54 64 70 72 ] Accepted
203 [ 46 48 52 52 54 64 70 72 ] Accepted
204 [ 48 52 52 54 66 70 72 ]    Accepted
205 [ 48 52 52 54 66 70 72 ]    Accepted
206 [ 46 48 52 52 54 79 ]       Accepted
207 [ 50 56 ]                   Rejected
208 [ 44 50 56 ]                Accepted
209 [ 44 44 50 58 ]             Accepted

Test Results:
Accepted: 453 out of 500 (0.906) ###################################################################

TA1.test.classify.txt:
Classification of the input sentences to acceptable and non-acceptable according to the acquired grammar. 1 - acceptable , o - non-acceptable.

MULTI-LEARNERS

Testing precision with multiple learners is done in the same manner as with a single one: each learner is tested separately. The recall test, however, is a bit different: each learner is tested, and the classification results are accumulated in proj_name.test.classify.txt file. Each recall test adds its results into the file, so that at the end line i of proj_name.test.classify.txt contains the number of times the learners accepted sentence i of the test set.

PRINTING

The patterns can be visualized using Matlab, by invoking the print option. To display the patterns, do the following:

./adios.exe -a print -i TA1.idx

Then run Matlab, set the workspace to the directory where ADIOS resides, and use the following function:

pattern(123, 'TA1');

where 123 is the ID of the pattern or Equivalence Class, and TA1 is the ADIOS project name. You should see the pattern displayed in a tree form. Equivalence classes appear in blue, terminals in black, and patterns in red.

The number in parentheses are the generalization factors.

Back to ADIOS