Skip to content

Commit 600f911

Browse files
authored
Merge pull request #14 from DistributedScience/manuscript_improvements
Manuscript improvements
2 parents 6580a81 + 838e2fd commit 600f911

19 files changed

+302
-10
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
# Distributed-Something
22
Run encapsulated docker containers that do... something in the Amazon Web Services (AWS) infrastructure.
3-
We are interested in scientific image analysis so we have used it for [CellProfiler](https://github.com/CellProfiler/Distributed-CellProfiler), [Fiji](https://github.com/CellProfiler/Distributed-Fiji), and [BioFormats2Raw](https://github.com/CellProfiler/Distributed-OmeZarrMaker).
3+
We are interested in scientific image analysis so we have used it for [CellProfiler](https://github.com/DistributedScience/Distributed-CellProfiler), [Fiji](https://github.com/DistributedScience/Distributed-Fiji), and [BioFormats2Raw](https://github.com/DistributedScience/Distributed-OmeZarrCreator).
44
You can use it for whatever you want!
55

66
## Documentation
77
Full documentation is available on our [Documentation Website](https://distributedscience.github.io/Distributed-Something).
8+
We have a fully-functional minimal example of a Distributed-Something application available at [Distributed-HelloWorld](https://github.com/DistributedScience/Distributed-HelloWorld).
89

910
## Overview
1011

config.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,15 @@
2828
# SQS QUEUE INFORMATION:
2929
SQS_QUEUE_NAME = APP_NAME + 'Queue'
3030
SQS_MESSAGE_VISIBILITY = 1*60 # Timeout (secs) for messages in flight (average time to be processed)
31-
SQS_DEAD_LETTER_QUEUE = 'arn:aws:sqs:some-region:111111100000:DeadMessages'
31+
SQS_DEAD_LETTER_QUEUE = 'user_DeadMessages'
3232

3333
# LOG GROUP INFORMATION:
3434
LOG_GROUP_NAME = APP_NAME
3535

36+
# CLOUDWATCH DASHBOARD CREATION
37+
CREATE_DASHBOARD = 'True' # Create a dashboard in Cloudwatch for run
38+
CLEAN_DASHBOARD = 'True' # Automatically remove dashboard at end of run with Monitor
39+
3640
# REDUNDANCY CHECKS
3741
CHECK_IF_DONE_BOOL = 'False' # True or False - should it check if there are a certain number of non-empty files and delete the job if yes?
3842
EXPECTED_NUMBER_FILES = 7 # What is the number of files that trigger skipping a job?

documentation/DS-documentation/_toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ parts:
2525
- file: step_4_monitor
2626
- caption: Technical guides
2727
chapters:
28+
- file: dashboard
2829
- file: troubleshooting_runs
2930
- file: hygiene
3031
- file: versions

documentation/DS-documentation/customizing_DS.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@
33
Distributed-Something is a template.
44
It is not fully functional software but is intended to serve as an editable source so that you can quickly and easily implement a distributed workflow for your own Dockerized software.
55

6-
Examples of implementations can be found at [Distributed-CellProfiler](http://github.com/cellprofiler/distributed-cellprofiler), [Distributed-Fiji](http://github.com/cellprofiler/distributed-fiji), and [Distributed-OMEZARRCreator](http://github.com/cellprofiler/distributed-omezarrcreator).
6+
Examples of sophisticated implementations can be found at [Distributed-CellProfiler](http://github.com/DistributedScience/distributed-cellprofiler), [Distributed-Fiji](http://github.com/DistributedScience/distributed-fiji), and [Distributed-OmeZarrCreator](http://github.com/DistributedScience/distributed-omezarrcreator).
7+
We have also created a minimal, fully functional example at [Distributed-HelloWorld](http://github.com/DistributedScience/distributed-helloworld).
78

89
## Customization overview
910

@@ -115,4 +116,6 @@ More configuration information is available in [Step 1: Configuration](step_1_co
115116

116117
#### run.py
117118

119+
You need to customize the Dashboard creation function by changing 'start run' to whatever your run command is.
118120
If you have changed anything in config.py, you will need to edit the section on Task Definitions to match.
121+
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# AWS Cloudwatch Dashboard
2+
![Cloudwatch Dashboard Overview](images/dashboard_overview.png)
3+
4+
AWS Cloudwatch Dashboards are “customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view and create customized views of the metrics and alarms for your AWS resources.”
5+
A Dashboard is full of widgets, each of which you create and customize to report on a separate AWS metric.
6+
Distributed-Something has the option to auto-create a Cloudwatch Dashboard for each run and the option to clean it up when you are done.
7+
These options are set in [your config file](step_1_configuration.md).
8+
9+
The Dashboard setup that DS auto-populates is helfpul for monitoring a run as it is occurring or for a post-mortem to better understand a previous run.
10+
Some things you can see include: whether your machines are sized appropriately for your jobs, how stable your spot fleet is, whether your jobs are failing and if so if they’re failing in a consistent manner.
11+
All told, this can help you understand and optimize your resource usage, thus saving you time and money
12+
13+
## FulfilledCapacity:
14+
![Fulfilled Capacity widget](images/fulfilledcapacity.png)
15+
16+
This widget shows the number of machines in your spot fleet that are fulfilled, i.e. how many machines you actually have at any given point.
17+
After a short spin-up time after initiating a run, you hope to see a straight line at the number of machines requested in your fleet and then a steady decrease at the end of a run as monitor scales your fleet down to match the remaining jobs.
18+
19+
Some number of small dips are all but inevitable as machines crash and are replaced or AWS takes some of your capacity and gives it to a higher bidder.
20+
However, every time there is a dip, it means that a machine that was running a job is no longer running it and any progress on that job is lost.
21+
The job will hang out as “Not Visible” in your SQS queue until it reaches the amount of time set by SQS_MESSAGE_VISIBILITY in [your config file](step_1_configuration.md).
22+
For quick jobs, this doesn’t have much of an impact, but for jobs that take many hours, this can be frustrating and potentially expensive.
23+
24+
If you’re seeing lots of dips or very large dips, you may be able to prevent this in future runs by 1) requesting a different machine type 2) bidding a larger amount for your machines 3) changing regions.
25+
You can also check if blips coincide with AWS outages, in which case there’s nothing you can do, it’s just bad luck (that’s what happened with the large dip in the example above).
26+
27+
## NumberOfMessagesReceived/Deleted
28+
29+
![NumberofMessagesReceived/Deleted](images/messages_deleted_received.png)
30+
31+
This widget shows you in bulk whether your jobs are completing or erroring.
32+
NumberOfMessagesDeleted shows messages deleted from the queue after the job has successfully completed.
33+
NumberOfMessagesReceived shows both messages that are deleted from the queue as well as messages that are put back in the queue because they errored.
34+
You hope to see that the two lines track on top of each other because that means no messages are erroring.
35+
If there are often gaps between the lines then it means a fraction of your jobs are erroring and you’ll need to figure out why (see MemoryUtilization and Show Errors or look directly in your Cloudwatch Logs for insights).
36+
37+
## MemoryUtilization
38+
39+
![Memory Utilization](images/memoryutilization.png)
40+
41+
Insufficient memory is the error that we most often encounter (as we try to use the smallest machines possible for economy’s sake) so we like to look at memory usage.
42+
Note that this is showing memory utilization in bulk for your cluster, not for individual machines.
43+
Because different machines reach memory intensive steps at different points in time, and because we’re looking at an average across 5 minute windows, the max percentage you see is likely to be much less than 100%, even if you are using all the memory in your machines at some points.
44+
45+
# MessagesVisible/NotVisible
46+
47+
![MessagesVisible/NotVisible](images/messages_change_slope.png)
48+
49+
Visible messages are messages waiting in your queue.
50+
Hidden messages (aka MessagesNotVisible) have been started and will remain hidden until either they are completed and therefore removed from the queue or they reach the time set in SQS_MESSAGE_VISIBILITY in your config file, whichever comes first.
51+
([Read more about Message Visibility](SQS_QUEUE_information.md).)
52+
After starting your fleet (and waiting long enough for at least one round of jobs to complete), you hope to see a linear decline in total messages with the number of hidden messages equal to the number of jobs being run (fleet size * tasks per machine * docker cores).
53+
54+
![Blip in MessagesVisible/NotVisible](images/blip_in_messagesnotvisible.png)
55+
56+
Sometimes you’ll see a blip where there is a rapid increase in the number of hidden messages (as pictured above).
57+
This can happen if there is an error on a machine and the hard disk gets full - it rapidly pulls jobs and puts them back until the machine error is caught and rebooted.
58+
This type of error shows in this widget as it happens.
59+
60+
If your spot fleet loses capacity (see FulfilledCapacity), you may see a blip in MessagesVisible/NotVisible where the number of hidden messages rapidly decreases.
61+
This appears in the widget the amount of time set in SQS_MESSAGE_VISIBILITY in your config file after the capacity loss when jobs that were started (i.e. hidden) but not completed return to visible status.
62+
63+
The relative slope of your graph can also be informative.
64+
For the run pictured at top, we discovered that a fraction of our jobs were erroring because the machines were running out of memory.
65+
Midway through 7/12 we upped the memory of the machines in our fleet and you can see from that point on a greater slope as more jobs were finishing in the same amount of time (as fewer were failing to complete because of memory errors.)
66+
67+
## Distinct Logs
68+
69+
![Logs comparison](images/logs_comparison.png)
70+
71+
This widget shows you the number of different specific jobs that start within your given time window by plotting the number of Cloudwatch logs that have your run command in them.
72+
In this example, our run command is "cellprofiler -c".
73+
It is not necessarily informative on its own, but very helpful when compared with the following widget.
74+
75+
## All logs
76+
This widget shows you the number of total times that jobs are started within your log group within the given time window.
77+
Ideally, you want this number to match the number in the previous widget as it means that each job is starting in your software only once.
78+
79+
If this number is consistently larger than the previous widget’s number, it could mean that some of your jobs are erroring and you’ll need to figure out why (see MemoryUtilization and Show Errors or look directly in your Cloudwatch Logs for insights).
80+
81+
## Show Errors
82+
![Show errors](images/expand_error_log.png)
83+
84+
This widget shows you the log entry any time that it contains “Error”.
85+
Ideally, this widget will remain empty.
86+
If it is logging errors, you can toggle each row for more information - it will show the job that errored in @logStream and the actual error message in @message.
87+
88+
## Interacting with a Dashboard:
89+
90+
Once you have your Dashboard created and full of widgets, you can adjust the timescale for which the widget is reporting metrics.
91+
For any of the widgets you can set the absolute or relative time that the widget is showing by selecting the time scale from the upper right corner of the screen.
92+
Zoom in to a particular time selection on a visible widget by drawing a box around that time on the widget itself (note that zooming in doesn’t change what’s plotted, just what part of the plot you can see so metrics like Show Errors won’t update with a zoom).
93+
94+
Some widgets allow you to select/deselect certain metrics plotted in the widget.
95+
To hide a metric without permanently removing it from the widget, simply click the X on the box next to the name of the metric in the legend.
96+
97+
You can move the widgets around on your dashboard by hovering on the upper right or upper left corner of a widget until a 4-direction-arrow icon appears and then dragging and dropping the widget.
98+
You can change the size of a widget by hovering on the lower right corner of the widget until a diagonal arrow icon appears and then dragging the widget to the desired size.
99+
After making changes, make sure to select Save dashboard from the top menu so that they are maintained after refreshing the page.
86.2 KB
Loading
17.7 KB
Loading
418 KB
Loading
118 KB
Loading
46.5 KB
Loading

0 commit comments

Comments
 (0)