Skip to content

Commit fb87fde

Browse files
authored
Merge branch 'main' into manuscript_improvements
2 parents 16a76db + 6580a81 commit fb87fde

File tree

10 files changed

+202
-75
lines changed

10 files changed

+202
-75
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,6 @@ When the cluster is up and running, you can monitor progress using the following
6666
The file APP_NAMESpotFleetRequestId.json is created after the cluster is setup in step 3.
6767
It is important to keep this monitor running if you want to automatically shutdown computing resources when there are no more tasks in the queue (recommended).
6868

69-
See the wiki for more information about each step of the process.
69+
See our [full documentation](https://distributedscience.github.io/Distributed-Something) for more information about each step of the process.
7070

71-
![Distributed-Something](https://user-images.githubusercontent.com/6721515/148241641-7e447d94-dc25-4214-afb1-132e3dc06987.png)
71+
![Distributed-Something](documentation/DS-documentation/images/Distributed-Something_chronological_overview.png)

documentation/DS-documentation/_toc.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ parts:
77
- caption: FAQ
88
chapters:
99
- file: overview
10+
- file: overview_2
11+
- file: costs
1012
- caption: Adapting Distributed-Something to a new application
1113
chapters:
1214
- file: customizing_DS
@@ -25,4 +27,5 @@ parts:
2527
chapters:
2628
- file: dashboard
2729
- file: troubleshooting_runs
30+
- file: hygiene
2831
- file: versions
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# What does Distributed-Something cost?
2+
3+
Distributed-Something is run by a series of three commands, only one of which incurs costs:
4+
5+
[`setup`](step_1_configuration.md) creates a queue in SQS and a cluster, service, and task definition in ECS.
6+
ECS is entirely free.
7+
SQS queues are free to create and use up to 1 million requests/month.
8+
9+
[`submitJobs`](step_2_submit_jobs.md) places messages in the SQS queue which is free (under 1 million requests/month).
10+
11+
[`startCluster`](step_1_start_cluster.md) is the only command that incurs costs.
12+
It initiates your spot fleet request, the major cost of running Distributed-Something, exact pricing of which depends on the number of machines, type of machines, and duration of use.
13+
Your bid is configured in the [config file](step_1_configuration.md).
14+
15+
Spot fleet costs can be minimized/stopped in multiple ways:
16+
1) We encourage the use of [`monitor`](step_4_monitor.md) during your job to help minimize the spot fleet cost as it automatically scales down your spot fleet request as your job queue empties and cancels your spot fleet request when you have no more jobs in the queue.
17+
2) If your job is finished, you can still initiate [`monitor`](step_4_monitor.md) to perform the same cleanup (without the automatic scaling).
18+
3) If you want to abort and clean up a run, you can purge your SQS queue in the [AWS SQS console](https://console.aws.amazon.com/sqs/) (by selecting your queue and pressing Actions => Purge) and then initiate [`monitor`](step_4_monitor.md) to perform the same cleanup.
19+
4) You can stop the spot fleet request directly in the [AWS EC2 console](https://console.aws.amazon.com/ec2/) by going to Instances => Spot Requests, selecting your spot request, and pressing Actions => Cancel Spot Request.
20+
21+
After the spot fleet has started, a Cloudwatch instance alarm is automatically placed on each instance in the fleet.
22+
Cloudwatch instance alarms are currently $0.10/alarm/month.
23+
Cloudwatch instance alarm costs can be minimized/stopped in multiple ways:
24+
1) If you run monitor during your job, it will automatically delete Cloudwatch alarms for any instance that is no longer in use once an hour while running and at the end of a run.
25+
2) If your job is finished, you can still initiate [`monitor`](step_4_monitor.md) to delete Cloudwatch alarms for any instance that is no longer in use.
26+
3) In [AWS Cloudwatch console](https://console.aws.amazon.com/cloudwatch/) you can select unused alarms by going to Alarms => All alarms. Change Any State to Insufficient Data, select all alarms, and then Actions => Delete.
27+
4) We provide a [hygiene script](hygiene.md) that will clean up old alarms for you.

documentation/DS-documentation/customizing_DS.md

Lines changed: 31 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,50 @@
1-
(customization)=
21
# Customizing DS
32

43
Distributed-Something is a template.
54
It is not fully functional software but is intended to serve as an editable source so that you can quickly and easily implement a distributed workflow for your own Dockerized software.
65

7-
Examples of sophisticated implementations can be found at [Distributed-CellProfiler](http://github.com/DistributedScience/distributed-cellprofiler), [Distributed-Fiji](http://github.com/DistributedScience/distributed-fiji), and [Distributed-OmeZarrMaker](http://github.com/DistributedScience/distributed-omezarrmaker).
6+
Examples of sophisticated implementations can be found at [Distributed-CellProfiler](http://github.com/DistributedScience/distributed-cellprofiler), [Distributed-Fiji](http://github.com/DistributedScience/distributed-fiji), and [Distributed-OmeZarrCreator](http://github.com/DistributedScience/distributed-omezarrcreator).
87
We have also created a minimal, fully functional example at [Distributed-HelloWorld](http://github.com/DistributedScience/distributed-helloworld).
98

109
## Customization overview
1110

1211
Before starting to customize Distributed-Something code, do some research on your desired implementation.
1312

14-
1) Ask how splittable is the function you want to distribute?
15-
If the end product you envision cannot easily be split into small tasks then it may not be a good fit for Distributed-Something.
16-
2) Make or find a Docker of the software you want to distribute.
13+
1) **Ask how splittable is the function you want to distribute?**
14+
Distributed-Something only works on "perfectly parallel" tasks, or tasks that do not communicate with each other while running.
15+
If the end product you envision cannot easily be split into perfectly parallel tasks, then it may not be a good fit for Distributed-Something.
16+
17+
Scale has a large impact on how splittable your function is.
18+
For example, if you want to stitch together a set of images into one larger image, that set that you are stitching is the smallest unit you can make your job. Because jobs must be "perfectly parallel", you cannot distribute the images any further.
19+
If you're generally working with datasets that only require a few stitching jobs, Distributed-Something may not be a good fit for your general use case.
20+
However, if you often work with very large datasets where you need to stitch many sets of images, even though you cannot further parallelize your jobs, distributing stitching tasks with Distributed-Something may still provide a significant savings in time and compute cost.
21+
22+
2) **Make or find a Docker of the software you want to distribute.**
1723
You can find over 1000 scientific softwares already Dockerized at [Biocontainers](http://biocontainers.pro) and many open-source softwares provide Docker files within their GitHub repositories.
1824
See [Implementing Distributed-Something](implementing_DS.md) for more details.
19-
3) Figure out how to make your software run from the command line.
25+
26+
3) **Figure out how to make your software run from the command line.**
2027
What parameters do you need to pass to it?
28+
Are there optional program parameters that you want to require in your Distributed-Something implementation?
2129
What is generic to how you like to run the application and what is different for each job?
2230

31+
4) **Think about how you will set up/access your data so that it is batchable/parallelizeable.**
32+
Because Distributed-Something is so application specific, there are many approaches one can take to parse a dataset into batches that can be parallelized.
33+
Implemented examples you can reference are:
34+
- In [Distributed-CellProfiler](https://github.com/DistributedScience/Distributed-CellProfiler), we use LoadData.csvs to pass to CellProfiler the exact list of files with their S3 file paths that we want it to access/download for processing.
35+
- In [Distributed-FIJI](https://github.com/DistributedScience/Distributed-Fiji), we tell it what folder to access and pass upload and download filters for it to select specific files within that folder.
36+
- In [Distributed-OMEZARRCreator](https://github.com/DistributedScience/Distributed-OMEZARRCreator), the job unit is always the same (one plate of images) so less flexibility is required and the S3 path and plate name passed in the job file is sufficient.
37+
38+
## Using the Distributed-Something template
39+
40+
Distributed-Something is a template repository.
41+
Read more about [Github template repositories](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-repository-from-a-template) and follow the instructions to create your own project repository from the template.
42+
We have chosen to provide DS as a template because it provides new implementations with a clean commit history.
43+
Because DS is so customizable, we expect that implementations will diverge from the template.
44+
Unlike forks, for which Github currently provides a "sync fork" function, templates do not have an automatic way of pulling changes from the template into repositories made from the template.
45+
If you anticipate wanting to keep your implementation more closely linked to the DS template, you can fork the template to create your own project repository instead and use the "sync fork" function as necessary.
46+
Or, use the six lines of [code described here](https://stackoverflow.com/questions/56577184/github-pull-changes-from-a-template-repository/69563752#69563752) to pull template changes into your repository.
47+
2348
## Customization details
2449

2550
There are many points at which you will need to customize Distributed-Something for your own implementation; These customization points are summarized below.
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# AWS Hygiene Scripts
2+
3+
See also [AUSPICES](https://github.com/broadinstitute/AuSPICES) for setting up various hygiene scripts to automatically run in your AWS account.
4+
5+
## Clean out old alarms
6+
7+
Python:
8+
9+
```python
10+
import boto3
11+
import time
12+
13+
filterstring = 'MyProjectName'
14+
15+
client = boto3.client('cloudwatch')
16+
alarms = client.describe_alarms(AlarmTypes=['MetricAlarm'],StateValue='INSUFFICIENT_DATA')
17+
while True:
18+
for eachalarm in alarms['MetricAlarms']:
19+
if eachalarm['StateValue'] == 'INSUFFICIENT_DATA':
20+
if filterstring in eachalarm['AlarmName']:
21+
client.delete_alarms(AlarmNames = [eachalarm['AlarmName']])
22+
time.sleep(1) #avoid throttling
23+
token = alarms['NextToken']
24+
print(token)
25+
alarms = client.describe_alarms(AlarmTypes=['MetricAlarm'],StateValue='INSUFFICIENT_DATA',NextToken=token)
26+
```
27+
28+
## Clean out old log groups
29+
Bash:
30+
31+
```sh
32+
aws logs describe-log-groups| in2csv -f json --key logGroups > logs.csv
33+
```
34+
35+
R:
36+
37+
(requires `dplyr` and `readr`)
38+
39+
```r
40+
library(dplyr)
41+
library(readr)
42+
read_csv(
43+
"logs.csv",
44+
col_types = cols_only(
45+
storedBytes = col_integer(),
46+
creationTime = col_double(),
47+
logGroupName = col_character()
48+
)
49+
) %>%
50+
mutate(creationTime =
51+
as.POSIXct(creationTime / 1000,
52+
origin = "1970-01-01")) %>%
53+
filter(storedBytes == 0) %>%
54+
select(logGroupName) %>%
55+
write_tsv("logs_clear.txt", col_names = F)
56+
```
57+
58+
Bash:
59+
60+
```sh
61+
parallel aws logs delete-log-group --log-group-name {1} :::: logs_clear.txt
62+
```
942 KB
Loading

documentation/DS-documentation/overview.md

Lines changed: 3 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@ Distributed-Something:
66
* simplifies the process of distributing and running software in the cloud.
77
* decreases the cost of cloud computing by optimizing resources used.
88
* makes workflows reproducible.
9-
* is Python based which makes it broadly accessible to novice computationalists.
9+
* is Python based which makes creating new implementations broadly accessible to intermediate computationalists.
10+
* only requires human-readable configuration to run an implementation which makes it broadly accessible to novice computationalists.
1011

1112
You will need to customize Distributed-Something for your particular use case.
1213
See [Customizing Distributed-Something](customizing_DS.md) for customization details.
@@ -24,8 +25,7 @@ Dockerizing a workflow has many benefits including
2425

2526
Using AWS allows you to create a flexible, on-demand computing infrastructure where you only have to pay for the resources you use.
2627
This can give you access to far more computing power than you may have available at your home institution, which is great when you have large datasets to process.
27-
28-
Each piece of the infrastructure has to be added and configured separately, which can be time-consuming and confusing.
28+
However, typically each piece of the infrastructure has to be added and configured separately, which can be time-consuming and confusing.
2929

3030
Distributed-Something tries to leverage the power of the former, while minimizing the problems of the latter.
3131

@@ -34,68 +34,6 @@ Distributed-Something tries to leverage the power of the former, while minimizin
3434
Essentially all you need to run Distributed-Something is an AWS account and a terminal program; see our [page on getting set up](step_0_prep.md) for all the specific steps you'll need to take.
3535
You will also need a Dockerized version of your software.
3636

37-
## What happens in AWS when I run Distributed-Something?
38-
39-
The steps for actually running the Distributed-Something code are outlined in the repository [README](https://github.com/DistributedScience/Distributed-Something/blob/master/README.md), and details of the parameters you set in each step are on their respective Documentation pages ([Step 1: Config](step_1_configuration.md), [Step 2: Jobs](step_2_submit_jobs.md), [Step 3: Fleet](step_3_start_cluster.md), and optional [Step 4: Monitor](step_4_monitor.md)).
40-
We'll give an overview of what happens in AWS at each step here and explain what AWS does automatically once you have it set up.
41-
42-
**Step 1**:
43-
In the Config file you set quite a number of specifics that are used by EC2, ECS, SQS, and in making Dockers.
44-
When you run `$ python3 run.py setup` to execute the Config, it does three major things:
45-
* Creates task definitions.
46-
These are found in ECS.
47-
They define the configuration of the Dockers and include the settings you gave for **CHECK_IF_DONE_BOOL**, **DOCKER_CORES**, **EXPECTED_NUMBER_FILES**, and **MEMORY**.
48-
* Makes a queue in SQS (it is empty at this point) and sets a dead-letter queue.
49-
* Makes a service in ECS which defines how many Dockers you want.
50-
51-
**Step 2**:
52-
In the Job file you set the location of any inputs (e.g. data and batch-specific scripts) and outputs.
53-
Additionally, you list all of the individual tasks that you want run.
54-
When you submit the Job file it adds that list of tasks to the queue in SQS (which you made in the previous step).
55-
Submit jobs with `$ python3 run.py submitJob`.
56-
57-
**Step 3**:
58-
In the Config file you set the number and size of the EC2 instances you want.
59-
This information, along with account-specific configuration in the Fleet file is used to start the fleet with `$ python3 run.py startCluster`.
60-
61-
**After these steps are complete, a number of things happen automatically**:
62-
* ECS puts Docker containers onto EC2 instances.
63-
If there is a mismatch within your Config file and the Docker is larger than the instance it will not be placed.
64-
ECS will keep placing Dockers onto an instance until it is full, so if you accidentally create instances that are too large you may end up with more Dockers placed on it than intended.
65-
This is also why you may want multiple **ECS_CLUSTER**s so that ECS doesn't blindly place Dockers you intended for one job onto an instance you intended for another job.
66-
* When a Docker container gets placed it gives the instance it's on its own name.
67-
* Once an instance has a name, the Docker gives it an alarm that tells it to reboot if it is sitting idle for 15 minutes.
68-
* The Docker hooks the instance up to the _perinstance logs in CloudWatch.
69-
* The instances look in SQS for a job.
70-
Any time they don't have a job they go back to SQS.
71-
If SQS tells them there are no visible jobs then they shut themselves down.
72-
* When an instance finishes a job it sends a message to SQS and removes that job from the queue.
73-
74-
## What does this look like?
75-
76-
![Example Instance Configuration](images/sample_DCP_config_1.png)
77-
78-
This is an example of one possible instance configuration using [Distributed-CellProfiler](http://github.com/cellprofiler/distributed-cellprofiler) as an example.
79-
This is one m4.16xlarge EC2 instance (64 CPUs, 250GB of RAM) with a 165 EBS volume mounted on it. A spot fleet could contain many such instances.
80-
It has 16 tasks (individual Docker containers).
81-
Each Docker container uses 10GB of hard disk space and is assigned 4 CPUs and 15 GB of RAM (which it does not share with other Docker containers).
82-
Each container shares its individual resources among 4 copies of CellProfiler.
83-
Each copy of CellProfiler runs a pipeline on one "job", which can be anything from a single image to an entire 384 well plate or timelapse movie.
84-
You can optionally stagger the start time of these 4 copies of CellProfiler, ensuring that the most memory- or disk-intensive steps aren't happening simultaneously, decreasing the likelihood of a crash.
85-
86-
Read more about this and other configurations in [Step 1: Configuration](step_1_configuration.md).
87-
88-
## How do I determine my configuration?
89-
90-
To some degree, you determine the best configuration for your needs through trial and error.
91-
* Looking at the resources your software uses on your local computer when it runs your jobs can give you a sense of roughly how much hard drive and memory space each job requires, which can help you determine your group size and what machines to use.
92-
* Prices of different machine sizes fluctuate, so the choice of which type of machines to use in your spot fleet is best determined at the time you run it.
93-
How long a job takes to run and how quickly you need the data may also affect how much you're willing to bid for any given machine.
94-
* Running a few large Docker containers (as opposed to many small ones) increases the amount of memory all the copies of your software are sharing, decreasing the likelihood you'll run out of memory if you stagger your job start times.
95-
However, you're also at a greater risk of running out of hard disk space.
96-
97-
Keep an eye on all of the logs the first few times you run any workflow and you'll get a sense of whether your resources are being utilized well or if you need to do more tweaking of your configuration.
98-
9937
## Can I contribute code to Distributed-Something?
10038

10139
Feel free! We're always looking for ways to improve.

0 commit comments

Comments
 (0)