Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting more information when a StackOverflowError occurs #161

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Vlatombe
Copy link

Hello 👋 ,

I work on the Jenkins project, and I'm investigating issues we're seeing while using this library to marshal and unmarshal a pipeline runtime state.

When marshalling or unmarshalling a very deep object structure, this can raise StackOverflowError depending on the configured JVM stack size.

Unfortunately, the current implementation only catches various exceptions, not StackOverflowError. In addition, when catching these the existing mecanism to attach TraceInformation doesn't work.

  • StackOverflowError doesn't support adding a cause (the current mechanism used by TraceInformation)
  • Adding TraceInformation as suppressed exception can itself lead to a new StackOverflowError.

These experimental changes do 2 things:

  • Use suppressed exceptions instead of cause to attach TraceInformation.
  • Wrap StackOverflowError into a RuntimeException. That way, the TraceInformation object can be attached as soon as the catching code goes up one level.

I don't expect this to be merged at all, it is more of a way to open discussion.

While looking at the project, I was also a bit confused:

  • The readme states issues should be reported using GH issues, however they are disabled for this repository.
  • The contributing document states issues are managed using RedHat Jira.
    So which is it?

When unmarshalling a very deep object structure, this can raise StackOverflowError.

There are multiple issues with the existing implementation, preventing attachment of TraceInformation to such object.

* StackOverflowError doesn't support adding a cause (the current mechanism used by TraceInformation)
* Adding TraceInformation as suppressed exception can itself lead to a new StackOverflowError.
Copy link
Contributor

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible. Note that this project has accepted diagnostic patches motivated by Jenkins Pipeline usage in the past: 83738e5

TraceInformation ti;
if (optionalTi.isEmpty()) {
ti = new TraceInformation();
t.addSuppressed(ti);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems sensible. Did you notice if there was any test coverage for this facility?

@@ -207,6 +207,8 @@ Object doReadNestedObject(final boolean unshared, final String enclosingClassNam
} catch (RuntimeException e) {
TraceInformation.addIncompleteObjectInformation(e, enclosingClassName);
throw e;
} catch (StackOverflowError e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be calling TraceInformation.addIncompleteObjectInformation here too? I suppose it would be added by the next stack frame up?

@dmlloyd
Copy link
Member

dmlloyd commented Feb 13, 2025

Hello 👋 ,

I work on the Jenkins project, and I'm investigating issues we're seeing while using this library to marshal and unmarshal a pipeline runtime state.

When marshalling or unmarshalling a very deep object structure, this can raise StackOverflowError depending on the configured JVM stack size.

Unfortunately, the current implementation only catches various exceptions, not StackOverflowError. In addition, when catching these the existing mecanism to attach TraceInformation doesn't work.

  • StackOverflowError doesn't support adding a cause (the current mechanism used by TraceInformation)

You can, actually, by using initCause:

$ jshell
|  Welcome to JShell -- Version 23
|  For an introduction type: /help intro

jshell> new StackOverflowError()
$1 ==> java.lang.StackOverflowError

jshell> $1.initCause(new RuntimeException())
$2 ==> java.lang.StackOverflowError

jshell> $2.getCause()
$3 ==> java.lang.RuntimeException
  • Adding TraceInformation as suppressed exception can itself lead to a new StackOverflowError.

These experimental changes do 2 things:

  • Use suppressed exceptions instead of cause to attach TraceInformation.
  • Wrap StackOverflowError into a RuntimeException. That way, the TraceInformation object can be attached as soon as the catching code goes up one level.

I guess "going up one level" is to avoid a recursive stack overflow?

I wonder if it would be a good idea to add a simple object-depth limiter. We already track the depth for the purposes of filtering, so it shouldn't be too hard to add a check to detect a maximum object depth violation and throw a RuntimeException rather than waiting for the JVM to die.

I don't expect this to be merged at all, it is more of a way to open discussion.

While looking at the project, I was also a bit confused:

  • The readme states issues should be reported using GH issues, however they are disabled for this repository.
  • The contributing document states issues are managed using RedHat Jira.
    So which is it?

It's currently JIRA, but with the intention of migrating to GitHub. I think the README probably just got copy-pasted though; sorry about that. Maybe this is the prompting I need to get the migration going.

@Vlatombe
Copy link
Author

You can, actually, by using initCause

Ah, I may have been confused by the recursive StackOverflowError.

I guess "going up one level" is to avoid a recursive stack overflow?

You guess right.

a check to detect a maximum object depth violation

Not sure what would be an appropriate value. In Jenkins case, the object tree being marshalled/unmarshalled is derived from a user input. The problem right now with the SOE is that we are unable to collect TraceInformation that would allow us to diagnose what part of the user input causes excessive depth into the resulting tree. If such a value could be derived from the actual -Xss with some headroom for diagnostic, this would be perfect, but from what I could look up it doesn't seem to be possible to retrieve this value programmatically.

@Vlatombe
Copy link
Author

You can, actually, by using initCause

In my debug session, cause for StackOverflowError is set to null, which prevents a further call to initCause.

I think this is because the JVM doesn't use the constructor to create StackOverflowError instances (https://github.com/openjdk/jdk21u-dev/blob/950e655064a75e20540955ad91430c8bea7ae73b/src/hotspot/share/runtime/sharedRuntime.cpp#L817-L838)

@dmlloyd
Copy link
Member

dmlloyd commented Feb 13, 2025

You can, actually, by using initCause

Ah, I may have been confused by the recursive StackOverflowError.

I guess "going up one level" is to avoid a recursive stack overflow?

You guess right.

a check to detect a maximum object depth violation

Not sure what would be an appropriate value. In Jenkins case, the object tree being marshalled/unmarshalled is derived from a user input. The problem right now with the SOE is that we are unable to collect TraceInformation that would allow us to diagnose what part of the user input causes excessive depth into the resulting tree. If such a value could be derived from the actual -Xss with some headroom for diagnostic, this would be perfect, but from what I could look up it doesn't seem to be possible to retrieve this value programmatically.

Right, and even if one could do so, it would be highly JVM- and platform-specific as to how the stack size in bytes translates to call frames and thus to allowable recursion depth. It's too bad they don't even let you access something like a percentage like double space = Thread.stackSpaceRemaining() or something. Then we could fail at (say) 80% of maximum.

This is not really a solution, or even directly related, but it's worth mentioning on this thread that you can independently establish a stack size on a per-thread basis if you construct the Thread instance yourself. I use this trick in another project which uses a very complex recursion-heavy algorithm to analyze a large data structure and it works well. And once your processing is finished, the thread can exit and immediately reclaim that memory, which is nice.

@dmlloyd
Copy link
Member

dmlloyd commented Feb 13, 2025

You can, actually, by using initCause

Ah, I may have been confused by the recursive StackOverflowError.

I guess "going up one level" is to avoid a recursive stack overflow?

You guess right.

a check to detect a maximum object depth violation

Not sure what would be an appropriate value. In Jenkins case, the object tree being marshalled/unmarshalled is derived from a user input. The problem right now with the SOE is that we are unable to collect TraceInformation that would allow us to diagnose what part of the user input causes excessive depth into the resulting tree. If such a value could be derived from the actual -Xss with some headroom for diagnostic, this would be perfect, but from what I could look up it doesn't seem to be possible to retrieve this value programmatically.

I think you could find it using a bisect methodology; start with, say, 100. If you hit the limit, it's too low; if you hit SOE, it's too high; if you hit both then your stack is too small.

Limiting depth is also useful for preventing DoS attacks when the payload might somehow be derived from an untrusted source.

@jglick
Copy link
Contributor

jglick commented Feb 25, 2025

you can independently establish a stack size on a per-thread basis

Thanks for the tip. It does work in a test environment (jenkinsci/workflow-cps-plugin#993) though it does not look great for production code.

@dmlloyd
Copy link
Member

dmlloyd commented Feb 25, 2025

you can independently establish a stack size on a per-thread basis

Thanks for the tip. It does work in a test environment (jenkinsci/workflow-cps-plugin#993) though it does not look great for production code.

Yeah, it's a little ugly. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants