The motion is out of control now.It took a while for experimenting attitude control learning. Though I've got ideas from actor-critic learning, I used simpler algorithms as the task is simple. Here's the idea (& result).
So the current issue is attitude control perhaps by reinforcement learning.
- Body rotation & the stabilization goal
The body to be stabilized rotates and the goal of stabilization is to align the bottom center to the vertical direction. The action of the body is acceleration to either direction. - States
The learner learns policies mapping from states to actions.
The states are represented with an analog vector whose dimensions represent the rotation angle of the body. In fact, the angle of the body is mapped to the vector by gaussian filters corresponding to the dimensions (and the angles allocated to the dimensions). For example, if there are 36 dimensions, each of them is allocated to a 10 degree notch. The gaussian closest to the current rotation angle output the largest value. - Linear architecture
The learner learns the mean acceleration for the current angle. The mean acceleration is modeled as a linear sum of the state vector so that the learner learns the sum weights.
average(output) = Σ wi * vi
The actual output (acceleration) is determined by a gaussian distribution with the mean acceleration given by the linear architecture and the variance, which is also learned. - Weights are learned with one of the following algorithms:
- Average success
The weight is set to the average output for the state. - Estimation difference
Move the weight toward the actual output in proportion to (actual output - estimated output), where the estimated output is the mean acceleration given by the linear architecture. - Reward
Action success is measured by the following rewards: - When the bottom sways away from the vertical position, then speed reduction is the reward.
- When the bottom is approaching to the vertical position, then the approach is the reward.
- Variance adjusting
When the action is successful, then adjust the variance σ in the following way (to adjust σ according to the difference to the discrepancy between the actual output and the estimate):
σ = σ + α × reward × (|output - estimate| - σ)
where α is the learning rate. - Result
Either algorithm worked (somehow the body stabilizes very quickly). Learning seems to converge faster with the average success. I opt for the average success as it is an accumulative method (the other method has no accumulative memory of the past). The estimation difference method also requires a mechanism to reduce its learning rate to fix the learning, which I didn't implement this time. - Current plan
This trial was not implemented as a Sigverse simulation but a stand-alone Java program. A Sigverse implementation is due where rotation will be three dimensional.