-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pure-python implementation #55
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice. At some point we need to separate ray and its dependencies into a separate wheel? Do you want to do that now?
self.params = self.base.params | ||
return is_valid | ||
|
||
|
||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking this main may need to move into python_launcher.py and ray_launcher.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the others as well of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
several comments
:return: 0 - success or 1 - failure | ||
""" | ||
start_ts = datetime.now().strftime("%Y-%m-%d %H:%M:%S") | ||
logger.info(f"orchestrator started at {start_ts}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding a name to TransformConfiguration or Transformer (it can be automatically discovered from the class name) and print it here.
f"orchestrator {name} started at {start_ts}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
logger.error("No DataAccess instance provided - exiting") | ||
return 1 | ||
# Get files to process | ||
files, profile = data_access.get_files_to_process() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have to provide an option to pass data directly from one transformer to another.
plus data_access
should support a specific file access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, no. The results are saved to S3
data_access_factory=data_access_factory, statistics=statistics, params=transform_config | ||
) | ||
# process data | ||
logger.debug("Begin processing files") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, the transformer name here will be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
logger.info(f"Completed {completed} files in {(time.time() - t_start)/60} min") | ||
logger.debug("Done processing files, waiting for flush() completion.") | ||
# invoke flush to ensure that all results are returned | ||
start = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are several start*
variables, I'd rename it here to something like start_flushing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah
"job_output_stats": stats, | ||
} | ||
logger.debug(f"Saved job metadata: {metadata}.") | ||
data_access.save_job_metadata(metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we always assume that there is the output folder? what about processing data from one transformer to another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. we assume that there is input/output
|
||
def _submit_for_execution(self) -> int: | ||
""" | ||
Submit for Ray execution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? submit to Ray execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
transform_config=self.transform_runtime_config, | ||
) | ||
logger.debug("Completed orchestrator") | ||
time.sleep(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
logger.debug("Completed orchestrator") | ||
time.sleep(10) | ||
except Exception as e: | ||
logger.info(f"Exception running ray remote orchestration\n{e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the message is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
# execute local processing | ||
logger.debug(f"Begin transforming table from {f_name}") | ||
out_tables, stats = self.transform.transform(table=table) | ||
logger.debug(f"Done transforming table from {f_name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add here the length of the out_tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah
try: | ||
# get flush results | ||
logger.debug(f"Begin flushing transform") | ||
out_tables, stats = self.transform.flush() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put t_start = time.time()
before this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, its just debug
|
||
class TransformConfiguration(CLIArgumentProvider): | ||
""" | ||
Provides support the configuration of a transformer running in the ray environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove 'ray'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everywhere :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. Its fortunately in 1 place
cli_prefix = "runtime_" | ||
|
||
|
||
class TransformExecutionConfiguration(CLIArgumentProvider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch a new class. I guess this needs some doc in the doc directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but without it it will be a lot of duplication of code
@@ -44,7 +44,7 @@ | |||
code_location = {"github": "github", "commit_hash": "12345", "path": "path"} | |||
|
|||
|
|||
class TestLauncher(TransformLauncher): | |||
class TestLauncher(TransformLauncherRay): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general common about all ray launcher esting. our classes need generalization to not only use/create ray-based launchers and then add cases for the pure python launcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
This is initial crack at pure python runtime. The only tested implementation is noop. Once we agree on the approach. Will do other transforms, that can do local run