-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SOAR-18956] Mimecast V2 - Update hash limits #3167
Conversation
plugins/mimecast_v2/icon_mimecast_v2/tasks/monitor_siem_logs/task.py
Outdated
Show resolved
Hide resolved
plugins/mimecast_v2/icon_mimecast_v2/tasks/monitor_siem_logs/task.py
Outdated
Show resolved
Hide resolved
plugins/mimecast_v2/icon_mimecast_v2/tasks/monitor_siem_logs/task.py
Outdated
Show resolved
Hide resolved
082b834
to
af0c414
Compare
plugins/mimecast_v2/icon_mimecast_v2/tasks/monitor_siem_logs/task.py
Outdated
Show resolved
Hide resolved
@@ -79,15 +109,41 @@ def get_siem_batches( | |||
urls = [batch.get("url") for batch in batch_list] | |||
return urls, batch_response.get("@nextPage"), caught_up | |||
|
|||
def get_siem_logs_from_batch(self, url: str): | |||
def resume_from_batch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we over complicated this ability to resume from the new list of files? we're doing the loop multiple times with multiple comparisons both times. could this not just iterate over list_of_batches
if we have a saved_url. Once we then hit the saved_url match, slice from that index->end. Later on when we then feed the pool_data
into get_siem_logs_from_batch
if the file name is saved_url
use that index otherwise it defaults to zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So have refactored resume_batch and now use partial to spread the saved values into the get_siem_logs_from_batch function instead of maintaining them in a tuple that resume_batch generates.
return pool_data | ||
|
||
def get_siem_logs_from_batch(self, url_and_position: Tuple[str, int]) -> Tuple[List[Dict], str]: | ||
url, line_start = url_and_position |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we simplify the resume_from_batching, this could be along the lines:
def get_siem_logs_from_batch(self, url, starting_url, starting_position):
starting_position = starting_position if url == starting_url else 1
<rest of logic can stay the same>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So have refactored resume_batch and now use partial to spread the saved values into the get_siem_logs_from_batch function instead of maintaining them in a tuple that resume_batch generates.
return pool_data | ||
|
||
def get_siem_logs_from_batch(self, url_and_position: Tuple[str, int]) -> Tuple[List[Dict], str]: | ||
url, line_start = url_and_position | ||
response = requests.request(method=GET, url=url, stream=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of interest does streaming=True work for this endpoint? when I was looking into this before when it does it usually means the API returns a content-length which would tell us if the file is going to exceed our content limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does and we can see our content length in bytes though I'm not sure how immediately useful this is. Thinking we would have to know the compression ratio and average size of each log JSON?
for batch_logs, url in result: | ||
if isinstance(batch_logs, (List, Dict)): | ||
with lock: | ||
total_count.value = total_count.value + len(batch_logs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's quite a few repeated len and calcs here, could we simplify it?
total_count.value = total_count.value + len(batch_logs) | |
total_batch_logs = len(batch_logs) | |
total_count.value = total_count.value + total_batch_logs | |
if total_count.value >= log_size_limit: | |
leftover_logs_count = total_count.value - log_size_limit | |
saved_position = (total_batch_logs - leftover_logs_count) | |
batch_logs = batch_logs[0 : saved_position] | |
logs.extend(batch_logs) | |
<contd> |
Using this should hopefully help memory a tad as well as we're not making new variables but slicing and dicing the current one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added that into remove a length calculation. I think the rest is maybe fine, we only really make the necessary calculations to get our counts. The subsection of logs could be done with negative splicing but we then also have to do a min check for batches of logs that return only 1 log.
response = requests.request(method=GET, url=url, stream=False) | ||
with gzip.GzipFile(fileobj=BytesIO(response.content), mode="rb") as file_: | ||
logs = [] | ||
# Iterate over lines in the decompressed file, decode and load the JSON | ||
for line in file_: | ||
for _, line in enumerate(file_, start=line_start): | ||
decoded_line = line.decode("utf-8").strip() | ||
logs.append(json.loads(decoded_line)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm just remembering we had issues on v1 that some files could contain malformed JSON and then we got stuck in a loop - should we allow for this in here again and continue to the next file?
…n decode error handling and unit test
af0c414
to
2c81ddd
Compare
logs.append(json.loads(decoded_line)) | ||
return logs | ||
try: | ||
logs.append(json.loads(decoded_line)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly I'm a bit wary that we could opening huge gzip files here and loading the entire content into memory. do we know if there's a limit on the files from mimecast or worth looking at how we used docker stats in the past? Although it does cause issues with catching the exception, and I might be being overcautious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to now use chunking which will keep us from storing the response.
Proposed Changes
Description
Describe the proposed changes:
PR Requirements
Developers, verify you have completed the following items by checking them off:
Testing
Unit Tests
Review our documentation on generating and writing plugin unit tests
In-Product Tests
If you are an InsightConnect customer or have access to an InsightConnect instance, the following in-product tests should be done:
Style
Review the style guide
USER nobody
in theDockerfile
when possiblerapid7/insightconnect-python-3-38-slim-plugin:{sdk-version-num}
andrapid7/insightconnect-python-3-38-plugin:{sdk-version-num}
insight-plugin validate
which callsicon_validate
to linthelp.md
Functional Checklist
tests/
directory created withinsight-plugin samples
tests/$action_bad.json
insight-plugin run -T tests/example.json --debug --jq
insight-plugin run -T all --debug --jq
(use PR format at end)insight-plugin run -R tests/example.json --debug --jq
insight-plugin run --debug --jq
(use PR format at end)Assessment
You must validate your work to reviewers:
insight-plugin validate
and make sure everything passesinsight-plugin run -A
. For single action validation:insight-plugin run tests/{file}.json -A
insight-plugin ... | pbcopy
) and paste the output in a new post on this PR