CoreRJob¶
Within core4, the most straightforward way to run R code is through the class CoreRJob.
As an example, let’s say we want to…
- pass details about a database (here, a collection in MongoDB) from a core4 job to an R script,
- fetch the data and perform operations on it within the R script,
- and finally, return the results to the job
There are 2 ways in which this can be done:
1. Having your entire logic in an R script
pyscript.py:
from core4.queue.helper.job.r import CoreRJob
class MyRJob(CoreRJob):
author = "sjo"
def execute(self):
test_param = "random string"
ret = self.r(source="rscript.R", # the source specifies the path of the R script to use
py_test_param=test_param) # we can also pass variables from python to R in the form of parameters
# the control is now passed over to the R script
self.logger.info("Result 1:\n", ret[0]) # result_df_1 from the R script
self.logger.info("Result 2:\n", ret[1]) # result_df_2 from the R script
if __name__ == '__main__':
from core4.queue.helper.functool import execute
execute(MyRJob)
rscript.R:
library(mongolite)
r_test_param <- {{ py_test_param }} # variables passed from python are thus captured in R
conn <- mongo(collection="{{ config.MyProject.MyCollection.name }}",
db="{{ config.MyProject.MyCollection.database }}",
url="{{ config.mongo_url }}")
# the R script can also directly access core4 configuration variables written in the form of jinja variables
df <- conn$find()
# core logic of the R script here
# say we store the results of the operations into 2 dataframes, 'result_df_1' and 'result_df_2' respectively
return(result_df_1)
return(result_df_2) # we can thus return multiple dataframes / values to python
2. Explicitly declaring an R session and calling specific functions within that session
pyscript.py:
from core4.queue.helper.job.r import CoreRJob
class MyRJob(CoreRJob):
author = "sjo"
def execute(self):
rsession = self.get_rsession() # session with required libraries
self.MyPythonFunction(rsession)
def MyPythonFunction(self, rsession):
collection_name = self.config.MyProject.MyCollection.name
database_name = self.config.MyProject.MyCollection.database
mongo_url = self.config.mongo_url
rsession.source("rscript.R") # specify the path of the R script to use
ret = rsession.MyRFunction(collection_name, database_name, mongo_url) # MyRFunction is in the R script
# core4 configuration variables cannot directly be accessed by functions in an R session
# therefore, they (as well as any other variables you want to pass to the R function) need to be passed as parameters
# the control is now passed over to the R function
self.logger.info("Result:\n", ret)
if __name__ == '__main__':
from core4.queue.helper.functool import execute
execute(MyRJob)
rscript.R:
library(mongolite)
MyRFunction <- function(collection, db, url){
conn <- mongo(collection, db, url)
df <- conn$find()
# core logic of the R function here
# say we store the results of the operations into the dataframe 'result_df'
return(result_df) # the result is returned to the python program
}
Note: In both approaches, the dataframe(s) we return to python cannot be nested. In case it is, a possible workaround is to “flatten” it by using a function such as flatten() from the jsonlite library in R before passing it to the python program.
Which of the two approaches to take can be decided based on the use case.
Approach 1 (entire logic in an R script) is useful if you want to…
- perform all your analyses in R and pass the end result to core4
- pass multiple results to core4
- have access to the core4 configuration in R
Approach 2 (running R functions run through a session) is useful if you want to to…
- implement the program’s logic partly in python and partly in R, i.e. use selective functionality from R
- use functions from python and R in a non-serial order