Basic demos
Basic working test
To ensure AgentSlang is correctly installed, you can use the basic_test.xml test configuration.
# Inside bin directory of AgentSlang installation folder
cd ${AGENTSLANGINSTALLDIR}/bin
# In GNU/Linux
./AgentSlang -config ../config/test_configurations/basic_test.xml -profile profile1
# In Windows
AgentSlang -config ../config/test_configurations/basic_test.xml -profile profile1
If everything’s working fine, you should have this output in your terminal
(INFORM)[org.ib.bricks.Test2] {id=1, language=none, data='Hello-t2:1'}
(INFORM)[org.ib.bricks.Test2] {id=0, language=none, data='Hello-t1:0'}
(INFORM)[org.ib.bricks.Test2] {id=1, language=none, data='Hello-t1:1'}
(INFORM)[org.ib.bricks.Test2] {id=2, language=none, data='Hello-t2:2'}
(INFORM)[org.ib.bricks.Test2] {id=2, language=none, data='Hello-t1:2'}
CereProc (Text To Speech) Component
For text to speech needs, you can use CereProc. It’s a proprietary and paid software, so you’ll need to own a license and at least one voice file. You can add these files in the dedicated folder named cereproc_files inside config. Then, you’ll need to modify the config file cereproc_demo.xml to indicate the correct path and name of these required files. These paths are indicated by the voice and licenseFile parameters.
After, you can launch the script:
# Inside bin directory of AgentSlang installation folder
cd ${AGENTSLANGINSTALLDIR}/bin
# In GNU/Linux
./AgentSlang -config ../config/test_configurations/cereproc_demo.xml -profile profile1
# In Windows
AgentSlang -config ../config/test_configurations/cereproc_demo.xml -profile profile1
A text terminal will appear, you can type the text you want to synthesise.
Advanced demos
Addressee Detector and VFOA Prediction
Currently the code can work to predict addressee and visual focus of attention during an interaction that involves up to 4 participants and up to three objects. Another category other
is also introduced for the focus when none of the participants and objects are in focus. The participants names are PM, UI, ME and ID where as the objects are table, whiteboard and side screen.
These two components require dependencies:
python3 -m pip install xgboost==0.90
python3 -m pip install scikit-learn==0.22.2
1. Addressee detector
The addressee predictor is a stand alone component which predicts the addressee of the current utterance. The addressee predictor is implemented in the addressee_predictor.py
file. The addressee_predictor.py
file containing the addressee predictor module consists of one class named AddrPredictor
with the following functions.
1.1. load_models()
The function loads the pickle object containing machine learning module for addressee detection. The function stores the loaded machine learning module in the self.addressee_predictor
variable and returns void.
The parameters of the load_model()
function are as follows.
Parameter Name (type) | Description | Possible Values |
---|---|---|
tc_predictor (string) | string path to the pickle object | Any string value containing file path. |
1.2. data_fv_converter()
Converts individual input parameters to a complete feature vector that can be passed to machine learning module for addressee prediction. The function returns a one dimensional numpy array representing all the features. The parameters of the data_fv_converter()
function are as follows.
Note: These attributes should be passed in order of their presence in the table.
Parameter Name (type) | Description | Possible Values |
---|---|---|
you_usage (binary) | whether or not a sentence contains the word you | 1 For True 0 For False |
duration_ms (floating) | duration of utterance in seconds | any floating value |
sentence_length (integer) | number of words in the utterance | any integer value |
focus_speaker (list) | A list of 8 floating items where each item shows ratio of speaker focus towards participants and objects. The index order is [0 = UI, 1 = ME, 2 = PM, , 3 = ID , 4 =table , 5 = slide screen, 6 = whiteboard , 7 = other ] | Any floating point list of 8 items. For example [0.14227, 0, 0, 0, 0.186047, 0.61683, 0, 0]. |
focus_listener_pm (list) | A list of 7 floating items where each item shows ratio of PM focus while listening towards participants and objects. The index order is [0 = UI, 1 = ME, 2 = ID, 3 =table , 4 = slide screen, 5 = whiteboard , 6 = other ] | Any floating point list of 7 items. For example [0, 0, 0, 0, 1, 0, 0] which means that PM looks at the whiteboard for the whole utterance since index 5 is 1. |
focus_listener_ui (list) | A list of 7 floating items where each item shows ratio of UI focus while listening towards participants and objects. The index order is [0 = ME, 1 = PM, 2 = ID, 3 =table , 4 = slide screen, 5 = whiteboard , 6 = other ] | Any floating point list of 7 items. For example [0, 1, 0, 0, 0, 0, 0] which means that UI looks at PM for the whole utterance since index 1 is 1. |
focus_listener_id (list) | A list of 7 floating items where each item shows ratio of ID focus while listening towards participants and objects. The index order is [0 = UI, 1 = ME, 2 = PM, 3 =table , 4 = slide screen, 5 = whiteboard , 6 = other ] | Any floating point list of 7 items. For example [0.5, 0.5, 0, 0, 0, 0, 0] which means that ID is looking at UI and MI for 50% each of the duration of the utterance since index 0 and 1. |
focus_listener_me (list) | A list of 7 floating items where each item shows ratio of ME focus while listening towards participants and objects. The index order is [0 = UI, 1 = PM, 2 = ID, 3 =table , 4 = slide screen, 5 = whiteboard , 6 = other ] | Any floating point list of 7 items. For example [0, 0, 1, 0, 0, 0, 0] which means that ME looks at ID for the whole utterance since index 2 is 1. |
speaker_role (string) | Represents role of the speaker of the current utterance | “pm”, “ui”, “me”, or “id” |
prev_speaker_role (string) | Represents role of the speaker of the previous utterance | “pm”, “ui”, “me”, or “id” |
prev_addr_role | Represents role of the addressee of the previous utterance | “group”, pm”, “ui”, “me”, or “id” |
da | dialogue act of the current utterance | Dialogue acts from AMI dataset: ‘ass’,’be.neg’, ‘be.pos’, ‘el.ass’, ‘el.inf’, ‘el.sug’, ‘el.und’, ‘inf’, ‘off’, ‘sug’, ‘und’, |
prev_da | dialogue act of the previous utterance | Dialogue acts from AMI dataset: ‘ass’,’be.neg’, ‘be.pos’, ‘el.ass’, ‘el.inf’, ‘el.sug’, ‘el.und’, ‘inf’, ‘off’, ‘sug’, ‘und’, |
1.3. predict_addr()
The predict_addr()
function ultimately predicts the addressee of the current utterance. The parameters of the predict_addr()
function are as follows.
Parameter Name (type) | Description | Possible Values |
---|---|---|
input_sequence (numpy array) | This is the numpy array containing feature vector returned by data_fv_converter() method |
numpy array of numbers containing numeric values for features |
1.4. AgentSlang Component
An AgentSlang called AddresseePredictorComponent.py
can be used and can be launched using a script:
# Go into AgentSlang installation directory
cd ${AGENTSLANGINSTALLDIR}
cd config/test_configurations
# Run Python module
./launch_python_addressee_predictor_component.sh
# Run AgentSlang program
cd ../../bin
./AgentSlang -config ../config/test_configurations/config_addressee_predictor.xml -profile profile1
An AgentSlang text promp will appear on your screen, you need to respect input conventions presented in the 1.2. part. This input must be a json string such as the one presented below:
{"you_usage": 0, "duration_ms": 7.31, "sentence_length": 20, "focus_speaker": [0.14227, 0, 0, 0, 0.186047, 0.61683, 0, 0], "focus_listener_pm": [0, 0, 0, 0, 0, 0, 0], "focus_listener_ui": [0, 0, 0, 0, 1, 0, 0], "focus_listener_id": [0, 0, 1, 0, 0, 0, 0], "focus_listener_me": [0, 0, 0, 0.32695, 0.0383035, 0, 0], "speaker_role": "pm", "prev_speaker_role": "pm", "prev_addr_role": "id", "da": "sug", "prev_da": "ass"}
The text prompt should give you this answer:
{id=1, language=en-US, data='group'}
You can try with other values according to your interaction data.
2. Speaker VFOA Behavior Generator
Speaker VFOA Behavior Generator is a stand alone component which predicts the complete VFOA of speaker during an utterance including number of VFOA turns per utterance, duration per VFOA turn, target per VFOA turn and scheduling of VFOA turns.
The VFOA behaviour generator is implemented in the speaker_VFOA_predictor.py
file. The speaker_VFOA_predictor.py
file containing the addressee predictor module consists of one class named SpkVFOAPredictor
with the following functions.
2.1. load_models()
The function loads the pickle object containing machine learning models for number of VFOA turns prediction, VFOA duration per turn prediction, and VFOA target per turn detection. The function stores the loaded machine learning models respectively in the self.num_vfoa_turns_pred
, ` self.vfoa_dur_pred , and,
self.vfoa_dir_pred` , variables and returns void.
The parameters of the load_model()
function are as follows.
Note: These arguments should be passed in order of their presence in the table.
Parameter Name (type) | Description | Possible Values |
---|---|---|
num_vfoa_turns_pred (string) | string path to the pickle object containing the machine learning model for number of VFOA turns prediction | Any string value containing file path. |
vfoa_dur_pred (string) | string path to the pickle object containing the machine learning model for prediction duration per VFOA turn | Any string value containing file path. |
vfoa_dir_pred (string) | string path to the pickle object containing the machine learning model for prediction target per VFOA turn | Any string value containing file path. |
2.2. data_fv_converter()
Converts individual input parameters to a complete feature vector that can be passed to machine learning module for addressee prediction. The function returns a one dimensional numpy array representing all the features. The parameters of the data_fv_converter()
function are as follows:
Note: These attributes should be passed in order of their presence in the table.
Parameter Name (type) | Description | Possible Values |
---|---|---|
start_time (float) | start time of utterance in milliseconds | any floating value |
end_time (float) | end time of utterance in milliseconds | any floating value |
duration_ms (floating) | duration of utterance in seconds | any floating value |
speaker_role (string) | Represents role of the speaker of the current utterance | “pm”, “ui”, “me”, or “id” |
addressee_role (string) | Represents role of the addressee of the current utterance utterance. This value is known to the agent who is also speaker of the current utterance. | “group”, pm”, “ui”, “me”, or “id” |
prev_addressee (string) | Represents role of the addressee of the previous utterance. This value can be predicted by Addressee predictor if the speaker of the previous utterance is not an agent. | “group”, pm”, “ui”, “me”, or “id” |
prev_speaker (string) | Represents role of the speaker of the previous utterance | “pm”, “ui”, “me”, or “id” |
da | dialogue act of the current utterance. | Dialogue acts from AMI dataset: ‘ass’,’be.neg’, ‘be.pos’, ‘el.ass’, ‘el.inf’, ‘el.sug’, ‘el.und’, ‘inf’, ‘off’, ‘sug’, ‘und’, |
2.3. schedule_vfoa()
The schedule_vfoa()
function is responsible for scheduling the VFOA. The output of schedule_vfoa()
function is a string which contains the name of the duration and targets of turns in a sequence.
This function is called from the generate_vfoa()
function (explained in next section). Hence, all the parameters to the schedule_vfoa()
are passed through the generate_vfoa()
method. The parameters of the schedule_vfoa()
function are as follows.
Note: You do not need to pass these parameters manually to the schedule_vfoa()
as they are passed via the generate_vfoa()
function.
Parameter Name (type) | Description | Possible Values |
---|---|---|
total_duration(float) | duration of the utterance in milliseconds | any floating value |
total_predicted (int) | number of predicted targets. | 1-8 (depending upon the number of participants and objects in sight). |
predicted_index (list) | List containing indexes of targets predicted in the current utterance. List cannot contain more than 8 items since the sum of all the participants and objects is 8. | List item values should not repeat and should be between 0-7. |
c_spk (string) | Represents role of the speaker of the current utterance | “pm”, “ui”, “me”, or “id” |
c_adr (string) | Represents role of the addressee of the current utterance utterance. This value is known to the agent who is also speaker of the current utterance. | “group”, pm”, “ui”, “me”, or “id” |
p_spk (string) | Represents role of the speaker of the previous utterance | “pm”, “ui”, “me”, or “id” |
p_adr (string) | Represents role of the addressee of the previous utterance. This value can be predicted by Addressee predictor if the speaker of the previous utterance is not an agent. | “group”, pm”, “ui”, “me”, or “id” |
2.4. generate_vfoa()
The generate_vfoa()
function is the function which predicts the final gaze including the turns, targets, duration and scheduling of VFOA. The parameters of the generate_vfoa()
function are as follows:
Note: These attributes should be passed in order of their presence in the table.
Parameter Name (type) | Description | Possible Values |
---|---|---|
total_duration(float) | duration of the utterance in milliseconds | any floating value |
input_sequence (numpy array) | This is the numpy array containing feature vector returned by data_fv_converter() method |
numpy array of numbers containing numeric values for features |
predicted_index (list) | List containing indexes of targets predicted in the current utterance. List cannot contain more than 8 items since the sum of all the participants and objects is 8. | List item values should not repeat and should be between 0-7. |
spk_role (string) | Represents role of the speaker of the current utterance | “pm”, “ui”, “me”, or “id” |
add_role (string) | Represents role of the addressee of the current utterance utterance. This value is known to the agent who is also speaker of the current utterance. | “group”, pm”, “ui”, “me”, or “id” |
pspk_role (string) | Represents role of the speaker of the previous utterance | “pm”, “ui”, “me”, or “id” |
padd_role (string) | Represents role of the addressee of the previous utterance. This value can be predicted by Addressee predictor if the speaker of the previous utterance is not an agent. | “group”, pm”, “ui”, “me”, or “id” |
2.5. AgentSlang Component
An AgentSlang called SpeakerVFOAPredictorComponent.py
can be used and can be launched using a script:
# Go into AgentSlang installation directory
cd ${AGENTSLANGINSTALLDIR}
cd config/test_configurations
# Run Python module
./launch_python_speaker_vfoa_predictor_component.sh
# Run AgentSlang program
cd ../../bin
./AgentSlang -config ../config/test_configurations/config_vfoa_predictor.xml -profile profile1
An AgentSlang text promp will appear on your screen, you need to respect input conventions presented in the 2.2. part. This input must be a json string such as the one presented below:
{"start_time": 1000120, "end_time": 1008790, "duration_ms": 3053, "speaker_role": "id", "addressee_role": "group", "prev_addressee": "group", "prev_speaker": "id", "da": "sug"}
The text prompt should give you this answer:
{id=1, language=en-US, data='ME:1220.0462758372528 , UI:1644.9864549903666 , Others:187.96726917238047 , '}
You can try with other values according to your interaction data.