[LinuxCNC/linuxcnc Issue#240] limit3 disregarding min, max and starting slow runaway sometimes About GRBL HAL

Issue #240 | 状态: 已关闭 | 作者: propcoder | 创建时间: 2017-03-04

标签: bug affects 2.7 affects master pull-request-welcome

out goes out of [min, max] range
and
sometimes slow runaway starts.

I noticed the slow and constant runaway two years ago, just could not catch it. Now I caught it and can reproduce.

Here are the steps I follow to reproduce the issue:

— On first terminal:
“halrun loadrt limit3 names=l setp l.min -50 setp l.max 50 setp l.maxv 4000 setp l.maxa 10000 loadrt siggen names=sg setp sg.amplitude 100 net s sg.sine => l.in loadrt threads name1=t period1=1000000 addf sg.update t addf l t start

`--- On second terminal:`watch -n 0.2 halcmd show pin l

`Watch for the out value go out of [min, max] limits.

--- Then on the first terminal: toggle the following lines manually at random times and wait for out to start the runaway instead of stopping. Be patient, try a lot of times. Second time I needed about 20 tries:`unlinkp l.in net s l.in“

This is what I expected to happen:

out stays inside a range of [-50, 50]
hold steady when all inputs are steady and after stop

This is what happened instead:

out goes out of the range
sometimes out runs away slowly when all inputs are steady and after expected stop

Information about my hardware and software:

* I am using this Linux distribution and version: Debian Jessie 8.7, amd64
* I am using this kernel version: 4.5.0-0.bpo.2-amd64
* I am running LinuxCNC 2.8.0-pre1-2881-ge624275, noticed on 2015 realtime x86 versions too
* [+] A binary version from linuxcnc.org (including buildbot.linuxcnc.org)

#1 – SebKuzminsky 于 2017-07-17

Here’s a halscope screenshot showing the limit3 constraint violation on 2.7.
!limit3-overshoot

#2 – SebKuzminsky 于 2017-07-17

And here’s a runaway. Note that i happened to disconnect its input while its output was in an overshoot condition.
!limit3-overshoot-runaway

#3 – SebKuzminsky 于 2017-07-18

I started adding tests for this issue in a branch named 2.7-limit3-issue-240

#4 – SebKuzminsky 于 2017-10-17

These tests are in a branch named “2.7-limit3-issue-240”.

#5 – jmkasunich 于 2017-10-17

Don’t know exactly what test you are running (and can’t go diving into that branch at the moment.)
Can you describe the test in plain language? What is the input signal? Steps? Triangles? Sines?

I recall something from the development stage of limit3 where certain odd inputs could screw it up. I think it might have been a triangle wave with a velocity greater than the limit3’s velocity limit, such that the limit block can NEVER track the input velocity. I don’t recall if I was able to fix it.

On Tue, Oct 17, 2017, at 01:03 PM, Sebastian Kuzminsky wrote:
> These tests are in a branch named “2.7-limit3-issue-240”.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly or view it on GitHub:
> https://github.com/LinuxCNC/linuxcnc/issues/240#issuecomment-337286435

—
John Kasunich
jmkasunich@fastmail.fm

#6 – SebKuzminsky 于 2017-10-17

“Runaway” test:
* limit3.min = -50
* limit3.max = 50
* limit3.maxv = 4000
* limit3.maxa = 10000
* limit3.in is a signal sampled from siggen sine with an amplitude of 100. The signal includes the first half-wave and part of the second half-wave; it ends at -42 and just sits there. The limit3 output keeps going down past the input and past its min. This looks like the “runaway” screenshot above, but without the connecting and disconnecting the limit3 in pin.

“Overshoot” test aka constraint violation:
* same limit3 min/max/maxv/maxa constraints as above
* input is a siggen sine with amplitude 100
* limit3 output goes slightly above max and below min, every cycle, then corrects and returns to the limit, until the next cycle (as shown in the constraint violation screenshot above)

#7 – samcoinc 于 2017-10-17

This is what I have done for testing.
http://electronicsam.com/images/KandT/testing/limit3/

I added a bunch of hal pins to expose internal variables (limit3.comp) I then ran the limit.hal file. I also ran the above

watch -n 0.2 halcmd show pin l

in a separate command window.

I would disconnect and reconnect sg.sine from the signal s while watching the l.out. after 5 to 10 disconnect and reconnects – l.out will keep moving in the negative direction and not stop while disconnected. It seems to always run away in the negative direction.

The spreadsheet in the directory has 2 sampler logs. The first set of data shows the run-away. The second set shows a normal (I think) disconnect.

I thought I could figure out what was going on – but am a bit over my head.

sam

#8 – jmkasunich 于 2017-10-17

Yuck. That is going to be a bear to troubleshoot. Limit3 has a bunch of non-trivial math (I remember filling a couple pages with algebra), and then some empirical tweaks to deal with the discrete time nature of the code vs continuous time of normal physics, where V = integral(A) and P = integral(V). It’s probably as close to “here be dragons” code as anything I’ve ever written.

One suggestion: look at the git history and see if/when the core math was last touched. Apply the same test to the prior versions. I thought I tested the original code pretty hard, but maybe something slipped thru. Or maybe some later change broke it in a way that didn’t show up in the testing at that time. Testing old versions will tell which situation applies.

John K

On Tue, Oct 17, 2017, at 02:20 PM, samcoinc wrote:
> This is what I have done for testing.
> http://electronicsam.com/images/KandT/testing/limit3/
>
> I added a bunch of hal pins to expose internal variables (limit3.comp) I then ran the limit.hal file. I also ran the above
>
> watch -n 0.2 halcmd show pin l
>
> in a separate command window.
>
> I would disconnect and reconnect sg.sine from the signal s while watching the l.out. after 5 to 10 disconnect and reconnects – l.out will keep moving in the negative direction and not stop while disconnected. It seems to always run away in the negative direction.
>
> The spreadsheet in the directory has 2 sampler logs. The first set of data shows the run-away. The second set shows a normal (I think) disconnect.
>
> I thought I could figure out what was going on – but am a bit over my head.
>
> sam
>
> —
> You are receiving this because you commented.
> Reply to this email directly or view it on GitHub:
> https://github.com/LinuxCNC/linuxcnc/issues/240#issuecomment-337319735

—
John Kasunich
jmkasunich@fastmail.fm

#9 – SebKuzminsky 于 2017-10-17

There hasn’t really been any changes to the core math of limit3 since it was moved from blocks to LinuxCNC. We converted the constraints from params to pins, added a “load” pin, and that’s about it.

I did test every version between the beginning and the current tip of 2.7, and all of them fail both tests described in this bug report.

#10 – jmkasunich 于 2017-10-17

Well damn… that means its my fault.

Not going to be able to look at it in the foreseeable future unfortunately.

On Tue, Oct 17, 2017, at 04:27 PM, Sebastian Kuzminsky wrote:
> There hasn’t really been any changes to the core math of limit3 since it was moved from blocks to LinuxCNC. We converted the constraints from params to pins, added a “load” pin, and that’s about it.
>
> I did test every version between the beginning and the current tip of 2.7, and all of them fail both tests described in this bug report.
>
> —
> You are receiving this because you commented.
> Reply to this email directly or view it on GitHub:
> https://github.com/LinuxCNC/linuxcnc/issues/240#issuecomment-337360026

—
John Kasunich
jmkasunich@fastmail.fm

#11 – jmkasunich 于 2017-10-17

I had been viewing this thread via email and had no idea that there were halscope screenshots at github. Just took a look at them now. (Also didn’t see the original posting, not sure exactly when/why I started being copied on the email, but glad it happened.)

I do recall seeing behavior similar to this during my early testing. I can only scratch the surface of the problem, it is too deep for me to wrap my head around right now. But it has to do with the sine-wave input. That combination of amplitude and frequency means that both accel, possibly velocity, and definitely position are beyond the limits. The limit3 block is trying to track the input. When the input changes too quickly for the output to follow (velocity or accel limitation), it tries hard to predict what the input will be doing and smoothly blend into that when it finally catches up (ie, when the position error becomes zero, it tried to also have the velocity error be zero). It works well for steps, because the input is stationary – easy to blend to that. It works for a triangle, as long as the velocity is within limit. If the velocity is outside the limit it can NEVER track the input, since a triangle input is always moving at either a positive or negative velocity that is too high. For a sine wave, it can’t really predict what the input will be doing by the time the position error is corrected.

In the case of the first test (overshoot), it looks like the output JUST manages to catch up to the input (matching position and velocity, while obeying accel limit) just at the instant that the input (and output) hits the position limit. It has no way of knowing that the input is about to hit the position limit and stop instantly. The output must obey accel limit, so it must overshoot. I’m curious what would happen if the sine wave was slower, such that the output managed to track the input well before the input hits the position limit….

I’m interested in being involved in debugging this, but I can’t devote much time to it for several days, and don’t really have a LinuxCNC development system handy. Please keep the thread alive and I’ll contribute whatever I can.

John K

On Tue, Oct 17, 2017, at 06:07 PM, John Kasunich wrote:
> Well damn… that means its my fault.
>
> Not going to be able to look at it in the foreseeable future unfortunately.
>
> On Tue, Oct 17, 2017, at 04:27 PM, Sebastian Kuzminsky wrote:
> > There hasn’t really been any changes to the core math of limit3 since it was moved from blocks to LinuxCNC. We converted the constraints from params to pins, added a “load” pin, and that’s about it.
> >
> > I did test every version between the beginning and the current tip of 2.7, and all of them fail both tests described in this bug report.
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly or view it on GitHub:
> > https://github.com/LinuxCNC/linuxcnc/issues/240#issuecomment-337360026
>
>
> —
> John Kasunich
> jmkasunich@fastmail.fm

—
John Kasunich
jmkasunich@fastmail.fm

#12 – SebKuzminsky 于 2017-10-18

I wonder if we should be aiming somehow to share code between the limit3 component and the free-mode joint trajectory planner. They both solve the exact same problem, right?

#13 – samcoinc 于 2017-10-19

It works pretty good most of the time. In the above test – the only time I can get it to fail is when the input is going negative between the limits (+/-50) It seems as if the logic picks the wrong direction to go. Instead of picking max-out (slowing down and changing direction to stop) it picks min-out in this situation and keeps going in the negative direction. I tried to get it not to fail within that range thinking if I could get good data I could see what is going on. It though seems to fail consistently.
~~~~~
if ( fabs(err+dp*2.0) < fabs(err) ) { rampa = -rampa;
}
if ( ramp_a < 0.0 ) { lout = min_out; data.oldv = minv;
} else {
lout = max_out;
data.oldv = maxv;
}
~~~~~

#14 – samcoinc 于 2017-10-19

no – my above assessment isn’t right. I should shut my mouth. (other than it seems to only fail going negative)

#15 – samcoinc 于 2017-10-19

Sorry seb – will be less lazy. Ok. it actually slows down to a stop (ramps down) but it is the changing of direction that doesn’t happen. (when going negative.) The below dataset was logged when the input was disconnected at -43.1456. You see it de-accellerated to 0 velocity. Now it should be picking -56.08 (int-max-out) but it instead picks -56.1. It looks like it may have something to do with this code.

“if ( inv > data.oldv ) { ramp_a = maxa; } else { ramp_a = -maxa; }`


In this dataset - the inv and data.oldv are both 0.  so ramp_a=-maxa.  In this situation it is the wrong choice.  (when the velocity is moving negative.)
so for testing I added these lines

`/ determine which way we need to ramp to match v / if ( inv > data.oldv ) { ramp_a = maxa; } else { ramp_a = -maxa; } if (minv < 0 && maxv > 0 && data.oldout < 0 && lin > min) { ramp_a = maxa; }“
This seems to fix the current problem (at least in my testing so far)
It is a cludge and I am a bit embarrassed… and I cannot figure out a better way of doing it.

!logset

#16 – samcoinc 于 2017-10-19

No not quite right yet. Shocker I know

#17 – samcoinc 于 2017-10-19

ok – have a look at this. I think this takes care of at least one corner case.. I have not had it fail yet…

“From 29ec4493096bd1d374f254bdf8c715a7d980a98a Mon Sep 17 00:00:00 2001 From: Sam Sokolik Date: Thu, 19 Oct 2017 17:20:13 -0500 Subject: Inital fix for limit3 negative run away.


Signed-off-by: Sam Sokolik 
diff --git a/src/hal/components/limit3.comp b/src/hal/components/limit3.comp

index 6ff14a7b1..3200f56a0 100644

--- a/src/hal/components/limit3.comp

+++ b/src/hal/components/limit3.comp

@@ -18,6 +18,7 @@ typedef struct {

     double old_in;	/ previous input /

     double old_out;	/ previous output /

     double old_v;	/ previous 1st derivative /

+    double older_v;     / Fix for negative run away /

 } limit3_data;

FUNCTION(_) { @@ -71,6 +72,10 @@ FUNCTION(_) { } else { ramp_a = -maxa; } + / An inelegant fix for slow negative runaway / + if (data.oldv == 0 && inv == 0 && data.older_v < 0 ) { + ramp_a = maxa; + } / determine how long the match would take / matchtime = ( inv - data.oldv ) / rampa; / where we will be at the end of the match / @@ -90,9 +95,11 @@ FUNCTION(_) { } if ( ramp_a < 0.0 ) { lout = min_out; + data.olderv = data.oldv; data.oldv = minv; } else { lout = max_out; + data.olderv = data.oldv; data.oldv = maxv; } }“

#18 – samcoinc 于 2017-10-21

Actually – you can make the runaway for all practical purposes never happen if you choose repeating decimal numbers for maxa. (which make me think my theory is correct as it only happens when oldv and inv are zero (and it was moving in the negative direction))

sam

#19 – samcoinc 于 2017-11-09

That did not fix all cases. And setting the maxa to a non repeating decimal seemed to just prolong the bug from appearing.. I think I give up for now.

#20 – zultron 于 2017-11-18

I ended up refactoring the comp in this [branch][1], which passes @SebKuzminsky’s tests with a small fudge.

It’s different from the old comp because it handles the first order max/min limits separately from the input signal. This way it can do its best to keep up with the input, but then do the right thing when it sees the max/min limit coming up.

It turns out to be a bit complicated to decide when to worry about the max/min limits, and when to worry about the input signal, as can be seen in the string of if cases. I did my best to annotate the code for clarity.

It also turns out to be hard to stop exactly on the max/min limits while at the same time observing those for velocity and acceleration. This code overshoots the target by about 0.5%, and then has to find its way back before locking onto the limit.

!limit3-halscope

[1]: https://github.com/zultron/machinekit/tree/2.7-limit3-issue-240

#21 – zultron 于 2017-11-18

See PR at #351

#22 – samcoinc 于 2017-11-28

I ran johns limit3 branch all weekend with no surprises.. Great work!

#23 – SebKuzminsky 于 2017-12-04

I just merged #351, so this will be fixed in v2.7.11-44-gcdb6603. Thanks to @propcoder for the bug report and @zultron and @samcoinc for the fix and the testing.

原始Issue: https://github.com/LinuxCNC/linuxcnc/issues/240

Here are the steps I follow to reproduce the issue:

This is what I expected to happen:

This is what happened instead:

Information about my hardware and software:

评论 (23)