Using Hadoop Impersonation | DeployR 8.x
Applies to: DeployR 8.x (See comparison between 8.x and 9.x[1])
Looking for docs for Microsoft R Server 9? Start here[2].
Typically when invoking system commands from R, those commands will run as the user that started R. In the case of Hadoop, if user abc logs into a node on a Hadoop cluster and starts R, any system commands, such as system("hadoop fs -ls /"), will run as user abc. File permissions in HDFS will be honored accordingly, for example user abc will not be able to access files in HDFS if that user does not have proper permissions.
However when using DeployR, every R session is started as the same user. This is an artifact of the Rserve component that was used to create and interact with R sessions. In order to work around this circumstance, Hadoop Impersonation[3] will be used. Hadoop impersonation is employed by standard Hadoop services like Hue and WebHDFS.
The idea behind impersonation is that R will set an environment variable that tells Hadoop to run commands as a different user than the user who started the R session. You can use one of two Hadoop-supported environment variables:
HADOOP_USER_NAMEfor non-kerberos secured clustersHADOOP_PROXY_USERfor clusters secured with Kerberos.
This document assumes we are working with a Kerberos secured cluster.
The rest of the document will focus on the steps needed to get Hadoop Impersonation to work with DeployR.
Creating the 'rserve' User
Since, the DeployR Rserve component runs by default as the apache user who does not have a home directory, we recommend that you create a new user that:
- Will be used when running Rserve
- Has a
homedirectory - Starts a bash shell
In this example, we will create the user rserve and change which user used to run Rserve.
Non Root Installation
If DeployR was installed as non-root, then you must ensure that the user that starts DeployR has a home directory and starts a bash shell. Nothing else is required.
Root Installation
If DeployR was installed as user root, do the following:
Create the Linux user
rserve.Go to the directory where DeployR is installed. For example:
cd /opt/deploy/<version>Go the
rservesub-directory and openrserve.shin an editor. For example:vi /opt/deploy/<version>/rserve/rserve.shReplace all instances of
apachewithrserve.There are 2 instances of the string
apachein the file.Give full write permissions to directory
workdir. For example:chmod 777 /opt/deployr/<version>/rserve/workdirEdit
/etc/groupand addrserveas a member of groupapache. For example:apache:x:502:rserveRestart Rserve.
cd /opt/deploy/<version>/rserve ./rserve.sh stop ./rserve.sh start
Setting Up the Environment for the 'rserve' User
ScaleR: If you are using ScaleR inside Hadoop, add the following line to
.bash_profilefor userrserve. This will ensure all environment variables needed by ScaleR are set properly.. ./etc/profile.d/Revolution/bash_profile_additionsKerberos: If your Hadoop cluster is secured using Kerberos, obtain a Kerberos ticket for principal
hdfs. This ticket will act as the proxy for all other users.Be sure you are the Linux user
rservewhen obtaining the ticket. For example:su - rserve kinit hdfs
We recommend that you use a
cronjob or equivalent to renew this ticket periodically to keep it from expiring.
Testing the Environment
Create a private file in HDFS. In the following example, user
revoowns the file:-rw------- 3 revo revo 14304763 2014-06-26 15:53 /share/AirlineDemoSmall/AirlineDemoSmall.csvCreate a sample RScript that tries to access the file. For example:
#in this case we read the file stored in HDFS into a dataframe filename<-"/share/AirlineDemoSmall/AirlineDemoSmall.csv" df<-read.table(pipe(paste('hadoop fs -cat ', filename, sep="")), sep=",", header=TRUE) summary(df)When you run this program from the R console as user
rserve, it will work because principalhdfshas a valid Kerberos ticket for the Linux userrserve.Modify the script to use the
HADOOP_PROXY_USERenvironment variable. For example, addSys.setenv(HADOOP_PROXY_USER='rserve'):#now we will run the script through the proxy filename<-"/share/AirlineDemoSmall/AirlineDemoSmall.csv" Sys.setenv(HADOOP_PROXY_USER='rserve') df<-read.table(pipe(paste('hadoop fs -cat ', filename, sep="")), sep=",", header=TRUE) summary(df)Unlike the previous step, when you run the script now, it will fail. This is due to the fact that the principal
hdfsis no longer acting as a superuser for the command. Instead, it is acting as proxy for the userrserve. This is a subtle but important difference.> #now we will run the script through the proxy > > filename<-"/share/AirlineDemoSmall/AirlineDemoSmall.csv" > Sys.setenv(HADOOP_PROXY_USER='rserve') > df<-read.table(pipe(paste('hadoop fs -cat ', filename, sep="")), sep=",", header=TRUE) cat: Permission denied: user=rserve, access=READ, inode="/share/AirlineDemoSmall/AirlineDemoSmall.csv":revo:revo:-rw------- Error in read.table(pipe(paste("hadoop fs -cat ", filename, sep = "")), : no lines available in input > summary(df) Error in object[[i]] : object of type 'closure' is not subsettableRScript for DeployR - We don't want to leave line of code that sets the HADOOP_PROXY_USER in our R Script. So we remove it and revert back to our original script. In addition, we will pass the filename into the script from our DeployR client application. So, the script simply becomes:
df<-read.table(pipe(paste('hadoop fs -cat ', filename, sep="")), sep=",", header=TRUE) summary(df)
We can now upload the script to DeployR using the Repository Manager. In our example, user testuser will create a directory called demo in the DeployR Repository Manager and name our RScript piperead.R.
Creating a DeployR Client
Create a DeployR client application to control which user will be running the script.
This example code is written in Visual Basic using the DeployR .NET client library. An equivalent example could be written in C#, Java or JavaScript.
Imports DeployR
Module Module1
Sub Main()
hdfsRead(False, "revo", "/share/AirlineDemoSmall/AirlineDemoSmall.csv", "piperead.R")
End Sub
Sub hdfsRead(ByVal bDebug As Boolean, _
ByVal sProxyUser As String, _
ByVal sHDFSfile As String, _
ByVal sRscript As String)
Dim client As RClient
Dim user As RUser
Dim project As RProject
Dim execution As RProjectExecution
Dim options As New ProjectExecutionOptions
Dim sEnv As String
Try
'set debug mode if we need it
RClientFactory.setDebugMode(bDebug)
'create an deployR client
client = RClientFactory.createClient("http://localhost:7300/deployr", 10)
'login
Dim basic As New RBasicAuthentication("testuser", "changeme")
user = client.login(basic, False)
'create a temporary project
project = user.createProject()
'set the Environment Variable
sEnv = "Sys.setenv(HADOOP_PROXY_USER='" + sProxyUser + "')"
project.executeCode(sEnv)
'collect the inputs
options.rinputs.Add(RDataFactory.createString("filename", sHDFSfile))
'execute the script
execution = project.executeScript(sRscript, "demo", "testuser", "", options)
'write the R console output
Console.WriteLine(execution.about.console)
Catch ex As Exception
'print exception
Console.WriteLine("Exception: " + ex.Message)
If (TypeOf ex Is HTTPRestException) Then
Dim hex As HTTPRestException = DirectCast(ex, HTTPRestException)
'print R console text, if available
Console.WriteLine("Console: " + hex.console)
End If
Finally
'close the project
project.close()
'logout
client.logout(user)
End Try
End Sub
End Module
Extending the Example
We can now extend this example to include two more RScripts for execution.
Second RScript
This script demonstrates how to copy a file from HDFS into the working directory of the R session. This RScript will be uploaded to the DeployR repository as copy_to_workspace.R, in directory demo for user testuser.
#copy a file from HDFS into the working directory
system(paste('hadoop fs -copyToLocal', filename, getwd(), sep=" "))
df<-read.table(basename(filename), sep=",", header=TRUE)
summary(df)
Third RScript
This script runs a ScaleR algorithm as a MapReduce job in Hadoop. This RScript will be uploaded to the DeployR repository as scaleR_hadoop.R, in directory demo for user testuser.
#run an rxSummary on 'filename'
myNameNode <- "default"
myPort <- 0
myHadoopCluster <- RxHadoopMR(consoleOutput=TRUE, autoCleanup=TRUE, nameNode=myNameNode, port=myPort)
rxSetComputeContext(myHadoopCluster)
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
colInfo <- list(DayOfWeek = list(type = "factor",
levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")))
airDS <- RxTextData(file = filename, missingValueString = "M",
colInfo = colInfo, fileSystem = hdfsFS)
adsSummary <- rxSummary(~ArrDelay+CRSDepTime+DayOfWeek, data = airDS)
print(adsSummary)
Update VB.NET code that executes all 3 examples
This is an update of the code shown here[4] that now reads in and executes all three scripts.
Imports DeployR
Module Module1
Sub Main()
hdfsRead(False, "revo", "/share/AirlineDemoSmall/AirlineDemoSmall.csv", "piperead.R")
hdfsRead(False, "revo", "/share/AirlineDemoSmall/AirlineDemoSmall.csv", "copy_to_workspace.R")
hdfsRead(False, "revo", "/share/AirlineDemoSmall/AirlineDemoSmall.csv", "scaleR_hadoop.R")
End Sub
Sub hdfsRead(ByVal bDebug As Boolean, _
ByVal sProxyUser As String, _
ByVal sHDFSfile As String, _
ByVal sRscript As String)
Dim client As RClient
Dim user As RUser
Dim project As RProject
Dim execution As RProjectExecution
Dim options As New ProjectExecutionOptions
Dim sEnv As String
Try
'set debug mode if we need it
RClientFactory.setDebugMode(bDebug)
'create an deployR client
client = RClientFactory.createClient("http://localhost:7300/deployr", 10)
'login
Dim basic As New RBasicAuthentication("testuser", "changeme")
user = client.login(basic, False)
'create a temporary project
project = user.createProject()
'set the Environment Variable
sEnv = "Sys.setenv(HADOOP_PROXY_USER='" + sProxyUser + "')"
project.executeCode(sEnv)
'collect the inputs
options.rinputs.Add(RDataFactory.createString("filename", sHDFSfile))
'execute the script
execution = project.executeScript(sRscript, "demo", "testuser", "", options)
'write the R console output
Console.WriteLine(execution.about.console)
Catch ex As Exception
'print exception
Console.WriteLine("Exception: " + ex.Message)
If (TypeOf ex Is HTTPRestException) Then
Dim hex As HTTPRestException = DirectCast(ex, HTTPRestException)
'print R console text, if available
Console.WriteLine("Console: " + hex.console)
End If
Finally
'close the project
project.close()
'logout
client.logout(user)
End Try
End Sub
End Module
References
- ^ comparison between 8.x and 9.x (msdn.microsoft.com)
- ^ Start here (msdn.microsoft.com)
- ^ Hadoop Impersonation (issues.apache.org)
- ^ code shown here (msdn.microsoft.com)
Comments