DistributedScience
diff --git a/‎README.md‎
Lines changed: 2 additions & 2 deletions b/‎README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎documentation/DS-documentation/_toc.yml‎
Lines changed: 3 additions & 0 deletions b/‎documentation/DS-documentation/_toc.yml‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎documentation/DS-documentation/costs.md‎
Lines changed: 27 additions & 0 deletions b/‎documentation/DS-documentation/costs.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎documentation/DS-documentation/customizing_DS.md‎
Lines changed: 31 additions & 6 deletions b/‎documentation/DS-documentation/customizing_DS.md‎
Lines changed: 31 additions & 6 deletions
diff --git a/‎documentation/DS-documentation/hygiene.md‎
Lines changed: 62 additions & 0 deletions b/‎documentation/DS-documentation/hygiene.md‎
Lines changed: 62 additions & 0 deletions
diff --git a/‎documentation/DS-documentation/images/Distributed-Something_chronological_overview.png‎
942 KB b/‎documentation/DS-documentation/images/Distributed-Something_chronological_overview.png‎
942 KB
diff --git a/‎documentation/DS-documentation/overview.md‎
Lines changed: 3 additions & 65 deletions b/‎documentation/DS-documentation/overview.md‎
Lines changed: 3 additions & 65 deletions
@@ -66,6 +66,6 @@ When the cluster is up and running, you can monitor progress using the following
 The file APP_NAMESpotFleetRequestId.json is created after the cluster is setup in step 3.
 It is important to keep this monitor running if you want to automatically shutdown computing resources when there are no more tasks in the queue (recommended).
 
-See the wiki for more information about each step of the process.
+See our [full documentation](https://distributedscience.github.io/Distributed-Something) for more information about each step of the process.
 
-![Distributed-Something](https://user-images.githubusercontent.com/6721515/148241641-7e447d94-dc25-4214-afb1-132e3dc06987.png)
+![Distributed-Something](documentation/DS-documentation/images/Distributed-Something_chronological_overview.png)
@@ -7,6 +7,8 @@ parts:
 - caption: FAQ
   chapters:
   - file: overview
+  - file: overview_2
+  - file: costs
 - caption: Adapting Distributed-Something to a new application
   chapters:
   - file: customizing_DS
@@ -25,4 +27,5 @@ parts:
   chapters:
   - file: dashboard
   - file: troubleshooting_runs
+  - file: hygiene
   - file: versions
@@ -0,0 +1,27 @@
+# What does Distributed-Something cost?
+
+Distributed-Something is run by a series of three commands, only one of which incurs costs:
+
+[`setup`](step_1_configuration.md) creates a queue in SQS and a cluster, service, and task definition in ECS. 
+ECS is entirely free. 
+SQS queues are free to create and use up to 1 million requests/month.
+
+[`submitJobs`](step_2_submit_jobs.md) places messages in the SQS queue which is free (under 1 million requests/month).
+
+[`startCluster`](step_1_start_cluster.md) is the only command that incurs costs. 
+It initiates your spot fleet request, the major cost of running Distributed-Something, exact pricing of which depends on the number of machines, type of machines, and duration of use. 
+Your bid is configured in the [config file](step_1_configuration.md).
+
+Spot fleet costs can be minimized/stopped in multiple ways:
+1) We encourage the use of [`monitor`](step_4_monitor.md) during your job to help minimize the spot fleet cost as it automatically scales down your spot fleet request as your job queue empties and cancels your spot fleet request when you have no more jobs in the queue.
+2) If your job is finished, you can still initiate [`monitor`](step_4_monitor.md) to perform the same cleanup (without the automatic scaling).
+3) If you want to abort and clean up a run, you can purge your SQS queue in the [AWS SQS console](https://console.aws.amazon.com/sqs/) (by selecting your queue and pressing Actions => Purge) and then initiate [`monitor`](step_4_monitor.md) to perform the same cleanup.
+4) You can stop the spot fleet request directly in the [AWS EC2 console](https://console.aws.amazon.com/ec2/) by going to Instances => Spot Requests, selecting your spot request, and pressing Actions => Cancel Spot Request.
+
+After the spot fleet has started, a Cloudwatch instance alarm is automatically placed on each instance in the fleet.
+Cloudwatch instance alarms are currently $0.10/alarm/month.
+Cloudwatch instance alarm costs can be minimized/stopped in multiple ways:
+1) If you run monitor during your job, it will automatically delete Cloudwatch alarms for any instance that is no longer in use once an hour while running and at the end of a run.
+2) If your job is finished, you can still initiate [`monitor`](step_4_monitor.md) to delete Cloudwatch alarms for any instance that is no longer in use.
+3) In [AWS Cloudwatch console](https://console.aws.amazon.com/cloudwatch/) you can select unused alarms by going to Alarms => All alarms. Change Any State to Insufficient Data, select all alarms, and then Actions => Delete.
+4) We provide a [hygiene script](hygiene.md) that will clean up old alarms for you.
@@ -1,25 +1,50 @@
-(customization)=
 # Customizing DS
 
 Distributed-Something is a template.
 It is not fully functional software but is intended to serve as an editable source so that you can quickly and easily implement a distributed workflow for your own Dockerized software.
 
-Examples of sophisticated implementations can be found at [Distributed-CellProfiler](http://github.com/DistributedScience/distributed-cellprofiler), [Distributed-Fiji](http://github.com/DistributedScience/distributed-fiji), and [Distributed-OmeZarrMaker](http://github.com/DistributedScience/distributed-omezarrmaker).
+Examples of sophisticated implementations can be found at [Distributed-CellProfiler](http://github.com/DistributedScience/distributed-cellprofiler), [Distributed-Fiji](http://github.com/DistributedScience/distributed-fiji), and [Distributed-OmeZarrCreator](http://github.com/DistributedScience/distributed-omezarrcreator).
 We have also created a minimal, fully functional example at [Distributed-HelloWorld](http://github.com/DistributedScience/distributed-helloworld).
 
 ## Customization overview
 
 Before starting to customize Distributed-Something code, do some research on your desired implementation.
 
-1) Ask how splittable is the function you want to distribute?
-If the end product you envision cannot easily be split into small tasks then it may not be a good fit for Distributed-Something.
-2) Make or find a Docker of the software you want to distribute.
+1) **Ask how splittable is the function you want to distribute?**
+Distributed-Something only works on "perfectly parallel" tasks, or tasks that do not communicate with each other while running.
+If the end product you envision cannot easily be split into perfectly parallel tasks, then it may not be a good fit for Distributed-Something.
+
+Scale has a large impact on how splittable your function is.
+For example, if you want to stitch together a set of images into one larger image, that set that you are stitching is the smallest unit you can make your job. Because jobs must be "perfectly parallel", you cannot distribute the images any further.
+If you're generally working with datasets that only require a few stitching jobs, Distributed-Something may not be a good fit for your general use case.
+However, if you often work with very large datasets where you need to stitch many sets of images, even though you cannot further parallelize your jobs, distributing stitching tasks with Distributed-Something may still provide a significant savings in time and compute cost.
+
+2) **Make or find a Docker of the software you want to distribute.**
 You can find over 1000 scientific softwares already Dockerized at [Biocontainers](http://biocontainers.pro) and many open-source softwares provide Docker files within their GitHub repositories.
 See [Implementing Distributed-Something](implementing_DS.md) for more details.
-3) Figure out how to make your software run from the command line.
+
+3) **Figure out how to make your software run from the command line.**
 What parameters do you need to pass to it?
+Are there optional program parameters that you want to require in your Distributed-Something implementation?
 What is generic to how you like to run the application and what is different for each job?
 
+4) **Think about how you will set up/access your data so that it is batchable/parallelizeable.**
+Because Distributed-Something is so application specific, there are many approaches one can take to parse a dataset into batches that can be parallelized.
+Implemented examples you can reference are:
+- In [Distributed-CellProfiler](https://github.com/DistributedScience/Distributed-CellProfiler), we use LoadData.csvs to pass to CellProfiler the exact list of files with their S3 file paths that we want it to access/download for processing. 
+- In [Distributed-FIJI](https://github.com/DistributedScience/Distributed-Fiji), we tell it what folder to access and pass upload and download filters for it to select specific files within that folder. 
+- In [Distributed-OMEZARRCreator](https://github.com/DistributedScience/Distributed-OMEZARRCreator), the job unit is always the same (one plate of images) so less flexibility is required and the S3 path and plate name passed in the job file is sufficient.
+
+## Using the Distributed-Something template
+
+Distributed-Something is a template repository.
+Read more about [Github template repositories](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-repository-from-a-template) and follow the instructions to create your own project repository from the template.
+We have chosen to provide DS as a template because it provides new implementations with a clean commit history.
+Because DS is so customizable, we expect that implementations will diverge from the template.
+Unlike forks, for which Github currently provides a "sync fork" function, templates do not have an automatic way of pulling changes from the template into repositories made from the template.
+If you anticipate wanting to keep your implementation more closely linked to the DS template, you can fork the template to create your own project repository instead and use the "sync fork" function as necessary.
+Or, use the six lines of [code described here](https://stackoverflow.com/questions/56577184/github-pull-changes-from-a-template-repository/69563752#69563752) to pull template changes into your repository.  
+
 ## Customization details
 
 There are many points at which you will need to customize Distributed-Something for your own implementation; These customization points are summarized below.
 
@@ -0,0 +1,62 @@
+# AWS Hygiene Scripts
+
+See also [AUSPICES](https://github.com/broadinstitute/AuSPICES) for setting up various hygiene scripts to automatically run in your AWS account.
+
+## Clean out old alarms
+
+Python:
+
+```python
+import boto3
+import time
+
+filterstring = 'MyProjectName'
+
+client = boto3.client('cloudwatch')
+alarms = client.describe_alarms(AlarmTypes=['MetricAlarm'],StateValue='INSUFFICIENT_DATA')
+while True:
+  for eachalarm in alarms['MetricAlarms']:
+    if eachalarm['StateValue'] == 'INSUFFICIENT_DATA':
+      if filterstring in eachalarm['AlarmName']:
+        client.delete_alarms(AlarmNames = [eachalarm['AlarmName']])
+        time.sleep(1) #avoid throttling
+  token = alarms['NextToken']
+  print(token)
+  alarms = client.describe_alarms(AlarmTypes=['MetricAlarm'],StateValue='INSUFFICIENT_DATA',NextToken=token)
+```
+
+## Clean out old log groups
+Bash:
+
+```sh
+aws logs describe-log-groups| in2csv -f json --key logGroups > logs.csv
+```
+
+R:
+
+(requires `dplyr` and `readr`)
+
+```r
+library(dplyr)
+library(readr)
+read_csv(
+  "logs.csv",
+  col_types = cols_only(
+    storedBytes = col_integer(),
+    creationTime = col_double(),
+    logGroupName = col_character()
+  )
+) %>%
+  mutate(creationTime =
+           as.POSIXct(creationTime / 1000,
+                      origin = "1970-01-01")) %>%
+  filter(storedBytes == 0) %>%
+  select(logGroupName) %>%
+  write_tsv("logs_clear.txt", col_names = F)
+```
+
+Bash:
+
+```sh
+parallel aws logs delete-log-group --log-group-name {1} :::: logs_clear.txt
+```
@@ -6,7 +6,8 @@ Distributed-Something:
 * simplifies the process of distributing and running software in the cloud.
 * decreases the cost of cloud computing by optimizing resources used.
 * makes workflows reproducible.
-* is Python based which makes it broadly accessible to novice computationalists.
+* is Python based which makes creating new implementations broadly accessible to intermediate computationalists.
+* only requires human-readable configuration to run an implementation which makes it broadly accessible to novice computationalists.
 
 You will need to customize Distributed-Something for your particular use case.
 See [Customizing Distributed-Something](customizing_DS.md) for customization details.
@@ -24,8 +25,7 @@ Dockerizing a workflow has many benefits including
 
 Using AWS allows you to create a flexible, on-demand computing infrastructure where you only have to pay for the resources you use.
 This can give you access to far more computing power than you may have available at your home institution, which is great when you have large datasets to process.
-
-Each piece of the infrastructure has to be added and configured separately, which can be time-consuming and confusing.
+However, typically each piece of the infrastructure has to be added and configured separately, which can be time-consuming and confusing.
 
 Distributed-Something tries to leverage the power of the former, while minimizing the problems of the latter.
 
@@ -34,68 +34,6 @@ Distributed-Something tries to leverage the power of the former, while minimizin
 Essentially all you need to run Distributed-Something is an AWS account and a terminal program; see our [page on getting set up](step_0_prep.md) for all the specific steps you'll need to take.
 You will also need a Dockerized version of your software.
 
-## What happens in AWS when I run Distributed-Something?
-
-The steps for actually running the Distributed-Something code are outlined in the repository [README](https://github.com/DistributedScience/Distributed-Something/blob/master/README.md), and details of the parameters you set in each step are on their respective Documentation pages ([Step 1: Config](step_1_configuration.md), [Step 2: Jobs](step_2_submit_jobs.md), [Step 3: Fleet](step_3_start_cluster.md), and optional [Step 4: Monitor](step_4_monitor.md)).
-We'll give an overview of what happens in AWS at each step here and explain what AWS does automatically once you have it set up.
-
-**Step 1**:
-In the Config file you set quite a number of specifics that are used by EC2, ECS, SQS, and in making Dockers.
-When you run `$ python3 run.py setup` to execute the Config, it does three major things:
-* Creates task definitions.
-These are found in ECS.
-They define the configuration of the Dockers and include the settings you gave for **CHECK_IF_DONE_BOOL**, **DOCKER_CORES**, **EXPECTED_NUMBER_FILES**, and **MEMORY**.
-* Makes a queue in SQS (it is empty at this point) and sets a dead-letter queue.
-* Makes a service in ECS which defines how many Dockers you want.
-
-**Step 2**:
-In the Job file you set the location of any inputs (e.g. data and batch-specific scripts) and outputs.
-Additionally, you list all of the individual tasks that you want run.
-When you submit the Job file it adds that list of tasks to the queue in SQS (which you made in the previous step).
-Submit jobs with `$ python3 run.py submitJob`.
-
-**Step 3**:
-In the Config file you set the number and size of the EC2 instances you want.
-This information, along with account-specific configuration in the Fleet file is used to start the fleet with `$ python3 run.py startCluster`.
-
-**After these steps are complete, a number of things happen automatically**:
-* ECS puts Docker containers onto EC2 instances.
-If there is a mismatch within your Config file and the Docker is larger than the instance it will not be placed.
-ECS will keep placing Dockers onto an instance until it is full, so if you accidentally create instances that are too large you may end up with more Dockers placed on it than intended.
-This is also why you may want multiple **ECS_CLUSTER**s so that ECS doesn't blindly place Dockers you intended for one job onto an instance you intended for another job.
-* When a Docker container gets placed it gives the instance it's on its own name.
-* Once an instance has a name, the Docker gives it an alarm that tells it to reboot if it is sitting idle for 15 minutes.
-* The Docker hooks the instance up to the _perinstance logs in CloudWatch.
-* The instances look in SQS for a job.
-Any time they don't have a job they go back to SQS.
-If SQS tells them there are no visible jobs then they shut themselves down.
-* When an instance finishes a job it sends a message to SQS and removes that job from the queue.
-
-## What does this look like?
-
-![Example Instance Configuration](images/sample_DCP_config_1.png)
-
-This is an example of one possible instance configuration using [Distributed-CellProfiler](http://github.com/cellprofiler/distributed-cellprofiler) as an example.
-This is one m4.16xlarge EC2 instance (64 CPUs, 250GB of RAM) with a 165 EBS volume mounted on it. A spot fleet could contain many such instances.
-It has 16 tasks (individual Docker containers).
-Each Docker container uses 10GB of hard disk space and is assigned 4 CPUs and 15 GB of RAM (which it does not share with other Docker containers).
-Each container shares its individual resources among 4 copies of CellProfiler.
-Each copy of CellProfiler runs a pipeline on one "job", which can be anything from a single image to an entire 384 well plate or timelapse movie.
-You can optionally stagger the start time of these 4 copies of CellProfiler, ensuring that the most memory- or disk-intensive steps aren't happening simultaneously, decreasing the likelihood of a crash.
-
-Read more about this and other configurations in [Step 1: Configuration](step_1_configuration.md).
-
-## How do I determine my configuration?
-
-To some degree, you determine the best configuration for your needs through trial and error.  
-* Looking at the resources your software uses on your local computer when it runs your jobs can give you a sense of roughly how much hard drive and memory space each job requires, which can help you determine your group size and what machines to use.  
-* Prices of different machine sizes fluctuate, so the choice of which type of machines to use in your spot fleet is best determined at the time you run it.
-How long a job takes to run and how quickly you need the data may also affect how much you're willing to bid for any given machine.
-* Running a few large Docker containers (as opposed to many small ones) increases the amount of memory all the copies of your software are sharing, decreasing the likelihood you'll run out of memory if you stagger your job start times.
-However, you're also at a greater risk of running out of hard disk space.  
-
-Keep an eye on all of the logs the first few times you run any workflow and you'll get a sense of whether your resources are being utilized well or if you need to do more tweaking of your configuration.
-
 ## Can I contribute code to Distributed-Something?
 
 Feel free!  We're always looking for ways to improve.