Getting more information when a StackOverflowError occurs #161

Vlatombe · 2025-02-13T11:32:17Z

Hello 👋 ,

I work on the Jenkins project, and I'm investigating issues we're seeing while using this library to marshal and unmarshal a pipeline runtime state.

When marshalling or unmarshalling a very deep object structure, this can raise StackOverflowError depending on the configured JVM stack size.

Unfortunately, the current implementation only catches various exceptions, not StackOverflowError. In addition, when catching these the existing mecanism to attach TraceInformation doesn't work.

StackOverflowError doesn't support adding a cause (the current mechanism used by TraceInformation)
Adding TraceInformation as suppressed exception can itself lead to a new StackOverflowError.

These experimental changes do 2 things:

Use suppressed exceptions instead of cause to attach TraceInformation.
Wrap StackOverflowError into a RuntimeException. That way, the TraceInformation object can be attached as soon as the catching code goes up one level.

I don't expect this to be merged at all, it is more of a way to open discussion.

While looking at the project, I was also a bit confused:

The readme states issues should be reported using GH issues, however they are disabled for this repository.
The contributing document states issues are managed using RedHat Jira.
So which is it?

When unmarshalling a very deep object structure, this can raise StackOverflowError. There are multiple issues with the existing implementation, preventing attachment of TraceInformation to such object. * StackOverflowError doesn't support adding a cause (the current mechanism used by TraceInformation) * Adding TraceInformation as suppressed exception can itself lead to a new StackOverflowError.

jglick

Looks sensible. Note that this project has accepted diagnostic patches motivated by Jenkins Pipeline usage in the past: 83738e5

jglick · 2025-02-13T13:49:02Z

api/src/main/java/org/jboss/marshalling/TraceInformation.java

+        TraceInformation ti;
+        if (optionalTi.isEmpty()) {
+            ti = new TraceInformation();
+            t.addSuppressed(ti);


Seems sensible. Did you notice if there was any test coverage for this facility?

jglick · 2025-02-13T13:49:53Z

river/src/main/java/org/jboss/marshalling/river/RiverUnmarshaller.java

@@ -207,6 +207,8 @@ Object doReadNestedObject(final boolean unshared, final String enclosingClassNam
        } catch (RuntimeException e) {
            TraceInformation.addIncompleteObjectInformation(e, enclosingClassName);
            throw e;
+        } catch (StackOverflowError e) {
+            throw new RuntimeException(e);


Should we be calling TraceInformation.addIncompleteObjectInformation here too? I suppose it would be added by the next stack frame up?

dmlloyd · 2025-02-13T14:51:50Z

Hello 👋 ,

I work on the Jenkins project, and I'm investigating issues we're seeing while using this library to marshal and unmarshal a pipeline runtime state.

When marshalling or unmarshalling a very deep object structure, this can raise StackOverflowError depending on the configured JVM stack size.

Unfortunately, the current implementation only catches various exceptions, not StackOverflowError. In addition, when catching these the existing mecanism to attach TraceInformation doesn't work.

StackOverflowError doesn't support adding a cause (the current mechanism used by TraceInformation)

You can, actually, by using initCause:

$ jshell
|  Welcome to JShell -- Version 23
|  For an introduction type: /help intro

jshell> new StackOverflowError()
$1 ==> java.lang.StackOverflowError

jshell> $1.initCause(new RuntimeException())
$2 ==> java.lang.StackOverflowError

jshell> $2.getCause()
$3 ==> java.lang.RuntimeException

Adding TraceInformation as suppressed exception can itself lead to a new StackOverflowError.

These experimental changes do 2 things:

Use suppressed exceptions instead of cause to attach TraceInformation.

Wrap StackOverflowError into a RuntimeException. That way, the TraceInformation object can be attached as soon as the catching code goes up one level.

I guess "going up one level" is to avoid a recursive stack overflow?

I wonder if it would be a good idea to add a simple object-depth limiter. We already track the depth for the purposes of filtering, so it shouldn't be too hard to add a check to detect a maximum object depth violation and throw a RuntimeException rather than waiting for the JVM to die.

I don't expect this to be merged at all, it is more of a way to open discussion.

While looking at the project, I was also a bit confused:

The readme states issues should be reported using GH issues, however they are disabled for this repository.

The contributing document states issues are managed using RedHat Jira.
So which is it?

It's currently JIRA, but with the intention of migrating to GitHub. I think the README probably just got copy-pasted though; sorry about that. Maybe this is the prompting I need to get the migration going.

Vlatombe · 2025-02-13T15:54:52Z

You can, actually, by using initCause

Ah, I may have been confused by the recursive StackOverflowError.

I guess "going up one level" is to avoid a recursive stack overflow?

You guess right.

a check to detect a maximum object depth violation

Not sure what would be an appropriate value. In Jenkins case, the object tree being marshalled/unmarshalled is derived from a user input. The problem right now with the SOE is that we are unable to collect TraceInformation that would allow us to diagnose what part of the user input causes excessive depth into the resulting tree. If such a value could be derived from the actual -Xss with some headroom for diagnostic, this would be perfect, but from what I could look up it doesn't seem to be possible to retrieve this value programmatically.

Vlatombe · 2025-02-13T16:30:25Z

You can, actually, by using initCause

In my debug session, cause for StackOverflowError is set to null, which prevents a further call to initCause.

I think this is because the JVM doesn't use the constructor to create StackOverflowError instances (https://github.com/openjdk/jdk21u-dev/blob/950e655064a75e20540955ad91430c8bea7ae73b/src/hotspot/share/runtime/sharedRuntime.cpp#L817-L838)

dmlloyd · 2025-02-13T16:32:17Z

You can, actually, by using initCause

Ah, I may have been confused by the recursive StackOverflowError.

I guess "going up one level" is to avoid a recursive stack overflow?

You guess right.

a check to detect a maximum object depth violation

Not sure what would be an appropriate value. In Jenkins case, the object tree being marshalled/unmarshalled is derived from a user input. The problem right now with the SOE is that we are unable to collect TraceInformation that would allow us to diagnose what part of the user input causes excessive depth into the resulting tree. If such a value could be derived from the actual -Xss with some headroom for diagnostic, this would be perfect, but from what I could look up it doesn't seem to be possible to retrieve this value programmatically.

Right, and even if one could do so, it would be highly JVM- and platform-specific as to how the stack size in bytes translates to call frames and thus to allowable recursion depth. It's too bad they don't even let you access something like a percentage like double space = Thread.stackSpaceRemaining() or something. Then we could fail at (say) 80% of maximum.

This is not really a solution, or even directly related, but it's worth mentioning on this thread that you can independently establish a stack size on a per-thread basis if you construct the Thread instance yourself. I use this trick in another project which uses a very complex recursion-heavy algorithm to analyze a large data structure and it works well. And once your processing is finished, the thread can exit and immediately reclaim that memory, which is nice.

dmlloyd · 2025-02-13T16:39:17Z

You can, actually, by using initCause

Ah, I may have been confused by the recursive StackOverflowError.

I guess "going up one level" is to avoid a recursive stack overflow?

You guess right.

a check to detect a maximum object depth violation

Not sure what would be an appropriate value. In Jenkins case, the object tree being marshalled/unmarshalled is derived from a user input. The problem right now with the SOE is that we are unable to collect TraceInformation that would allow us to diagnose what part of the user input causes excessive depth into the resulting tree. If such a value could be derived from the actual -Xss with some headroom for diagnostic, this would be perfect, but from what I could look up it doesn't seem to be possible to retrieve this value programmatically.

I think you could find it using a bisect methodology; start with, say, 100. If you hit the limit, it's too low; if you hit SOE, it's too high; if you hit both then your stack is too small.

Limiting depth is also useful for preventing DoS attacks when the payload might somehow be derived from an untrusted source.

jglick · 2025-02-25T00:52:57Z

you can independently establish a stack size on a per-thread basis

Thanks for the tip. It does work in a test environment (jenkinsci/workflow-cps-plugin#993) though it does not look great for production code.

dmlloyd · 2025-02-25T14:57:37Z

you can independently establish a stack size on a per-thread basis

Thanks for the tip. It does work in a test environment (jenkinsci/workflow-cps-plugin#993) though it does not look great for production code.

Yeah, it's a little ugly. :-)

jglick approved these changes Feb 13, 2025

View reviewed changes

jglick mentioned this pull request Feb 13, 2025

Batch diagnostic tool for StackOverflowErrors jenkinsci/workflow-support-plugin#305

Closed

jglick mentioned this pull request Feb 25, 2025

More informative ObjectInfo #163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting more information when a StackOverflowError occurs #161

Getting more information when a StackOverflowError occurs #161

Vlatombe commented Feb 13, 2025

jglick left a comment

jglick Feb 13, 2025

jglick Feb 13, 2025

dmlloyd commented Feb 13, 2025

Vlatombe commented Feb 13, 2025

Vlatombe commented Feb 13, 2025

dmlloyd commented Feb 13, 2025

dmlloyd commented Feb 13, 2025

jglick commented Feb 25, 2025

dmlloyd commented Feb 25, 2025

Getting more information when a StackOverflowError occurs #161

Are you sure you want to change the base?

Getting more information when a StackOverflowError occurs #161

Conversation

Vlatombe commented Feb 13, 2025

jglick left a comment

Choose a reason for hiding this comment

jglick Feb 13, 2025

Choose a reason for hiding this comment

jglick Feb 13, 2025

Choose a reason for hiding this comment

dmlloyd commented Feb 13, 2025

Vlatombe commented Feb 13, 2025

Vlatombe commented Feb 13, 2025

dmlloyd commented Feb 13, 2025

dmlloyd commented Feb 13, 2025

jglick commented Feb 25, 2025

dmlloyd commented Feb 25, 2025