Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SOAR-18956] Mimecast V2 - Update hash limits #3167

Merged
merged 14 commits into from
Feb 26, 2025

Conversation

ablakley-r7
Copy link
Collaborator

@ablakley-r7 ablakley-r7 commented Feb 21, 2025

Proposed Changes

Description

Describe the proposed changes:

  • Update hash limits for state to reduce state size
  • Update log limit to 7500 for all logs
  • Context: SOAR-18956

PR Requirements

Developers, verify you have completed the following items by checking them off:

Testing

Unit Tests

Review our documentation on generating and writing plugin unit tests

  • Unit tests written for any new or updated code

In-Product Tests

If you are an InsightConnect customer or have access to an InsightConnect instance, the following in-product tests should be done:

  • Screenshot of job output with the plugin changes
  • Screenshot of the changed connection, actions, or triggers input within the InsightConnect workflow builder

Style

Review the style guide

  • For dependencies, pin OS package and Python package versions
  • For security, set least privileged account with USER nobody in the Dockerfile when possible
  • For size, use the slim SDK images when possible: rapid7/insightconnect-python-3-38-slim-plugin:{sdk-version-num} and rapid7/insightconnect-python-3-38-plugin:{sdk-version-num}
  • For error handling, use of PluginException and ConnectionTestException
  • For logging, use self.logger
  • For docs, use changelog style
  • For docs, validate markdown with insight-plugin validate which calls icon_validate to lint help.md

Functional Checklist

  • Work fully completed
  • Functional
    • Any new actions/triggers include JSON test files in the tests/ directory created with insight-plugin samples
    • Tests should all pass unless it's a negative test. Negative tests have a naming convention of tests/$action_bad.json
    • Unsuccessful tests should fail by raising an exception causing the plugin to die and an object should be returned on successful test
    • Add functioning test results to PR, sanitize any output if necessary
      • Single action/trigger insight-plugin run -T tests/example.json --debug --jq
      • All actions/triggers shortcut insight-plugin run -T all --debug --jq (use PR format at end)
    • Add functioning run results to PR, sanitize any output if necessary
      • Single action/trigger insight-plugin run -R tests/example.json --debug --jq
      • All actions/triggers shortcut insight-plugin run --debug --jq (use PR format at end)

Assessment

You must validate your work to reviewers:

  1. Run insight-plugin validate and make sure everything passes
  2. Run the assessment tool: insight-plugin run -A. For single action validation: insight-plugin run tests/{file}.json -A
  3. Copy (insight-plugin ... | pbcopy) and paste the output in a new post on this PR
  4. Add required screenshots from the In-Product Tests section

@ablakley-r7 ablakley-r7 requested a review from a team as a code owner February 21, 2025 09:59
@@ -79,15 +109,41 @@ def get_siem_batches(
urls = [batch.get("url") for batch in batch_list]
return urls, batch_response.get("@nextPage"), caught_up

def get_siem_logs_from_batch(self, url: str):
def resume_from_batch(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we over complicated this ability to resume from the new list of files? we're doing the loop multiple times with multiple comparisons both times. could this not just iterate over list_of_batches if we have a saved_url. Once we then hit the saved_url match, slice from that index->end. Later on when we then feed the pool_data into get_siem_logs_from_batch if the file name is saved_url use that index otherwise it defaults to zero?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So have refactored resume_batch and now use partial to spread the saved values into the get_siem_logs_from_batch function instead of maintaining them in a tuple that resume_batch generates.

return pool_data

def get_siem_logs_from_batch(self, url_and_position: Tuple[str, int]) -> Tuple[List[Dict], str]:
url, line_start = url_and_position
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we simplify the resume_from_batching, this could be along the lines:

def get_siem_logs_from_batch(self, url, starting_url, starting_position):
   starting_position = starting_position if url == starting_url else 1
   <rest of logic can stay the same>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So have refactored resume_batch and now use partial to spread the saved values into the get_siem_logs_from_batch function instead of maintaining them in a tuple that resume_batch generates.

return pool_data

def get_siem_logs_from_batch(self, url_and_position: Tuple[str, int]) -> Tuple[List[Dict], str]:
url, line_start = url_and_position
response = requests.request(method=GET, url=url, stream=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of interest does streaming=True work for this endpoint? when I was looking into this before when it does it usually means the API returns a content-length which would tell us if the file is going to exceed our content limit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does and we can see our content length in bytes though I'm not sure how immediately useful this is. Thinking we would have to know the compression ratio and average size of each log JSON?

for batch_logs, url in result:
if isinstance(batch_logs, (List, Dict)):
with lock:
total_count.value = total_count.value + len(batch_logs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's quite a few repeated len and calcs here, could we simplify it?

Suggested change
total_count.value = total_count.value + len(batch_logs)
total_batch_logs = len(batch_logs)
total_count.value = total_count.value + total_batch_logs
if total_count.value >= log_size_limit:
leftover_logs_count = total_count.value - log_size_limit
saved_position = (total_batch_logs - leftover_logs_count)
batch_logs = batch_logs[0 : saved_position]
logs.extend(batch_logs)
<contd>

Using this should hopefully help memory a tad as well as we're not making new variables but slicing and dicing the current one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added that into remove a length calculation. I think the rest is maybe fine, we only really make the necessary calculations to get our counts. The subsection of logs could be done with negative splicing but we then also have to do a min check for batches of logs that return only 1 log.

response = requests.request(method=GET, url=url, stream=False)
with gzip.GzipFile(fileobj=BytesIO(response.content), mode="rb") as file_:
logs = []
# Iterate over lines in the decompressed file, decode and load the JSON
for line in file_:
for _, line in enumerate(file_, start=line_start):
decoded_line = line.decode("utf-8").strip()
logs.append(json.loads(decoded_line))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm just remembering we had issues on v1 that some files could contain malformed JSON and then we got stuck in a loop - should we allow for this in here again and continue to the next file?

@ablakley-r7 ablakley-r7 force-pushed the soar-18956_mimecast_v2 branch from af0c414 to 2c81ddd Compare February 25, 2025 17:23
logs.append(json.loads(decoded_line))
return logs
try:
logs.append(json.loads(decoded_line))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly I'm a bit wary that we could opening huge gzip files here and loading the entire content into memory. do we know if there's a limit on the files from mimecast or worth looking at how we used docker stats in the past? Although it does cause issues with catching the exception, and I might be being overcautious

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to now use chunking which will keep us from storing the response.

@ablakley-r7 ablakley-r7 merged commit 6bb44ac into develop Feb 26, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants