Data Analysis
The heavy lifting is done via (a series of) data processing scripts, housed within the Scripts directory. This should include any major carpentry (reshaping, organizing, filtering, etc.), along with any intensive statistical analyses. The output of both should be saved to the Outputs directory. Throughout this process, you should treat the raw data as ‘read only’ and the processed data as disposable. To emphasize the important part of this, THE RAW DATA SHOULD NEVER BE OVERWRITTEN! Data is usually labor- and cost-intensive to collect. You want to minimize the possibility of corrupting data by modifying the raw data as little as possible. Further, in order for your analyses to be replicable, it’s crucial that you retain the raw data so others can repeat the steps you took (or try their own analyses). By storing the processed data, you reap an additional benefit in the reduction of the run time / computation power required for data visualization and manuscript prep. As you’ll see, both phases are highly iterative (you’ll usually knit the same document many times before it’s ready to be sent out), and the last thing you’ll want is for that process to be bogged down by computationally intensive processes or analyses.
In order to balance the need to retain the raw data and also reduce run times, data processing scripts read in raw data and save processed data, which can be used downstream for visualizations and other purposes.
Splitting Scripts
The same principle can be applied to your data processing scripts; if you’ve got several different analyses that go into processing your data, and one is really long and computationally intensive, you can split the different processing steps into separate scripts. Once you modify the controller script, you can pick and choose when each processing step should be run. The approach you use should be tailored to the needs of the project - not all projects require / would benefit from the added complexity of splitting scripts.