mirror of
				https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
				synced 2025-10-31 02:59:34 +00:00 
			
		
		
		
	 670e9f34ee
			
		
	
	
		670e9f34ee
		
	
	
	
	
		
			
			Remove many duplicated words under Documentation/ and do other small
cleanups.
Examples:
        "and and" --> "and"
        "in in" --> "in"
        "the the" --> "the"
        "the the" --> "to the"
        ...
Signed-off-by: Paolo Ornati <ornati@fastwebnet.it>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
		
	
			
		
			
				
	
	
		
			296 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			296 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| 
 | |
| unshare system call:
 | |
| --------------------
 | |
| This document describes the new system call, unshare. The document
 | |
| provides an overview of the feature, why it is needed, how it can
 | |
| be used, its interface specification, design, implementation and
 | |
| how it can be tested.
 | |
| 
 | |
| Change Log:
 | |
| -----------
 | |
| version 0.1  Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
 | |
| 
 | |
| Contents:
 | |
| ---------
 | |
| 	1) Overview
 | |
| 	2) Benefits
 | |
| 	3) Cost
 | |
| 	4) Requirements
 | |
| 	5) Functional Specification
 | |
| 	6) High Level Design
 | |
| 	7) Low Level Design
 | |
| 	8) Test Specification
 | |
| 	9) Future Work
 | |
| 
 | |
| 1) Overview
 | |
| -----------
 | |
| Most legacy operating system kernels support an abstraction of threads
 | |
| as multiple execution contexts within a process. These kernels provide
 | |
| special resources and mechanisms to maintain these "threads". The Linux
 | |
| kernel, in a clever and simple manner, does not make distinction
 | |
| between processes and "threads". The kernel allows processes to share
 | |
| resources and thus they can achieve legacy "threads" behavior without
 | |
| requiring additional data structures and mechanisms in the kernel. The
 | |
| power of implementing threads in this manner comes not only from
 | |
| its simplicity but also from allowing application programmers to work
 | |
| outside the confinement of all-or-nothing shared resources of legacy
 | |
| threads. On Linux, at the time of thread creation using the clone system
 | |
| call, applications can selectively choose which resources to share
 | |
| between threads.
 | |
| 
 | |
| unshare system call adds a primitive to the Linux thread model that
 | |
| allows threads to selectively 'unshare' any resources that were being
 | |
| shared at the time of their creation. unshare was conceptualized by
 | |
| Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
 | |
| of the discussion on POSIX threads on Linux.  unshare augments the
 | |
| usefulness of Linux threads for applications that would like to control
 | |
| shared resources without creating a new process. unshare is a natural
 | |
| addition to the set of available primitives on Linux that implement
 | |
| the concept of process/thread as a virtual machine.
 | |
| 
 | |
| 2) Benefits
 | |
| -----------
 | |
| unshare would be useful to large application frameworks such as PAM
 | |
| where creating a new process to control sharing/unsharing of process
 | |
| resources is not possible. Since namespaces are shared by default
 | |
| when creating a new process using fork or clone, unshare can benefit
 | |
| even non-threaded applications if they have a need to disassociate
 | |
| from default shared namespace. The following lists two use-cases
 | |
| where unshare can be used.
 | |
| 
 | |
| 2.1 Per-security context namespaces
 | |
| -----------------------------------
 | |
| unshare can be used to implement polyinstantiated directories using
 | |
| the kernel's per-process namespace mechanism. Polyinstantiated directories,
 | |
| such as per-user and/or per-security context instance of /tmp, /var/tmp or
 | |
| per-security context instance of a user's home directory, isolate user
 | |
| processes when working with these directories. Using unshare, a PAM
 | |
| module can easily setup a private namespace for a user at login.
 | |
| Polyinstantiated directories are required for Common Criteria certification
 | |
| with Labeled System Protection Profile, however, with the availability
 | |
| of shared-tree feature in the Linux kernel, even regular Linux systems
 | |
| can benefit from setting up private namespaces at login and
 | |
| polyinstantiating /tmp, /var/tmp and other directories deemed
 | |
| appropriate by system administrators.
 | |
| 
 | |
| 2.2 unsharing of virtual memory and/or open files
 | |
| -------------------------------------------------
 | |
| Consider a client/server application where the server is processing
 | |
| client requests by creating processes that share resources such as
 | |
| virtual memory and open files. Without unshare, the server has to
 | |
| decide what needs to be shared at the time of creating the process
 | |
| which services the request. unshare allows the server an ability to
 | |
| disassociate parts of the context during the servicing of the
 | |
| request. For large and complex middleware application frameworks, this
 | |
| ability to unshare after the process was created can be very
 | |
| useful.
 | |
| 
 | |
| 3) Cost
 | |
| -------
 | |
| In order to not duplicate code and to handle the fact that unshare
 | |
| works on an active task (as opposed to clone/fork working on a newly
 | |
| allocated inactive task) unshare had to make minor reorganizational
 | |
| changes to copy_* functions utilized by clone/fork system call.
 | |
| There is a cost associated with altering existing, well tested and
 | |
| stable code to implement a new feature that may not get exercised
 | |
| extensively in the beginning. However, with proper design and code
 | |
| review of the changes and creation of an unshare test for the LTP
 | |
| the benefits of this new feature can exceed its cost.
 | |
| 
 | |
| 4) Requirements
 | |
| ---------------
 | |
| unshare reverses sharing that was done using clone(2) system call,
 | |
| so unshare should have a similar interface as clone(2). That is,
 | |
| since flags in clone(int flags, void *stack) specifies what should
 | |
| be shared, similar flags in unshare(int flags) should specify
 | |
| what should be unshared. Unfortunately, this may appear to invert
 | |
| the meaning of the flags from the way they are used in clone(2).
 | |
| However, there was no easy solution that was less confusing and that
 | |
| allowed incremental context unsharing in future without an ABI change.
 | |
| 
 | |
| unshare interface should accommodate possible future addition of
 | |
| new context flags without requiring a rebuild of old applications.
 | |
| If and when new context flags are added, unshare design should allow
 | |
| incremental unsharing of those resources on an as needed basis.
 | |
| 
 | |
| 5) Functional Specification
 | |
| ---------------------------
 | |
| NAME
 | |
| 	unshare - disassociate parts of the process execution context
 | |
| 
 | |
| SYNOPSIS
 | |
| 	#include <sched.h>
 | |
| 
 | |
| 	int unshare(int flags);
 | |
| 
 | |
| DESCRIPTION
 | |
| 	unshare allows a process to disassociate parts of its execution
 | |
| 	context that are currently being shared with other processes. Part
 | |
| 	of execution context, such as the namespace, is shared by default
 | |
| 	when a new process is created using fork(2), while other parts,
 | |
| 	such as the virtual memory, open file descriptors, etc, may be
 | |
| 	shared by explicit request to share them when creating a process
 | |
| 	using clone(2).
 | |
| 
 | |
| 	The main use of unshare is to allow a process to control its
 | |
| 	shared execution context without creating a new process.
 | |
| 
 | |
| 	The flags argument specifies one or bitwise-or'ed of several of
 | |
| 	the following constants.
 | |
| 
 | |
| 	CLONE_FS
 | |
| 		If CLONE_FS is set, file system information of the caller
 | |
| 		is disassociated from the shared file system information.
 | |
| 
 | |
| 	CLONE_FILES
 | |
| 		If CLONE_FILES is set, the file descriptor table of the
 | |
| 		caller is disassociated from the shared file descriptor
 | |
| 		table.
 | |
| 
 | |
| 	CLONE_NEWNS
 | |
| 		If CLONE_NEWNS is set, the namespace of the caller is
 | |
| 		disassociated from the shared namespace.
 | |
| 
 | |
| 	CLONE_VM
 | |
| 		If CLONE_VM is set, the virtual memory of the caller is
 | |
| 		disassociated from the shared virtual memory.
 | |
| 
 | |
| RETURN VALUE
 | |
| 	On success, zero returned. On failure, -1 is returned and errno is
 | |
| 
 | |
| ERRORS
 | |
| 	EPERM	CLONE_NEWNS was specified by a non-root process (process
 | |
| 		without CAP_SYS_ADMIN).
 | |
| 
 | |
| 	ENOMEM	Cannot allocate sufficient memory to copy parts of caller's
 | |
| 		context that need to be unshared.
 | |
| 
 | |
| 	EINVAL	Invalid flag was specified as an argument.
 | |
| 
 | |
| CONFORMING TO
 | |
| 	The unshare() call is Linux-specific and  should  not be used
 | |
| 	in programs intended to be portable.
 | |
| 
 | |
| SEE ALSO
 | |
| 	clone(2), fork(2)
 | |
| 
 | |
| 6) High Level Design
 | |
| --------------------
 | |
| Depending on the flags argument, the unshare system call allocates
 | |
| appropriate process context structures, populates it with values from
 | |
| the current shared version, associates newly duplicated structures
 | |
| with the current task structure and releases corresponding shared
 | |
| versions. Helper functions of clone (copy_*) could not be used
 | |
| directly by unshare because of the following two reasons.
 | |
|   1) clone operates on a newly allocated not-yet-active task
 | |
|      structure, where as unshare operates on the current active
 | |
|      task. Therefore unshare has to take appropriate task_lock()
 | |
|      before associating newly duplicated context structures
 | |
|   2) unshare has to allocate and duplicate all context structures
 | |
|      that are being unshared, before associating them with the
 | |
|      current task and releasing older shared structures. Failure
 | |
|      do so will create race conditions and/or oops when trying
 | |
|      to backout due to an error. Consider the case of unsharing
 | |
|      both virtual memory and namespace. After successfully unsharing
 | |
|      vm, if the system call encounters an error while allocating
 | |
|      new namespace structure, the error return code will have to
 | |
|      reverse the unsharing of vm. As part of the reversal the
 | |
|      system call will have to go back to older, shared, vm
 | |
|      structure, which may not exist anymore.
 | |
| 
 | |
| Therefore code from copy_* functions that allocated and duplicated
 | |
| current context structure was moved into new dup_* functions. Now,
 | |
| copy_* functions call dup_* functions to allocate and duplicate
 | |
| appropriate context structures and then associate them with the
 | |
| task structure that is being constructed. unshare system call on
 | |
| the other hand performs the following:
 | |
|   1) Check flags to force missing, but implied, flags
 | |
|   2) For each context structure, call the corresponding unshare
 | |
|      helper function to allocate and duplicate a new context
 | |
|      structure, if the appropriate bit is set in the flags argument.
 | |
|   3) If there is no error in allocation and duplication and there
 | |
|      are new context structures then lock the current task structure,
 | |
|      associate new context structures with the current task structure,
 | |
|      and release the lock on the current task structure.
 | |
|   4) Appropriately release older, shared, context structures.
 | |
| 
 | |
| 7) Low Level Design
 | |
| -------------------
 | |
| Implementation of unshare can be grouped in the following 4 different
 | |
| items:
 | |
|   a) Reorganization of existing copy_* functions
 | |
|   b) unshare system call service function
 | |
|   c) unshare helper functions for each different process context
 | |
|   d) Registration of system call number for different architectures
 | |
| 
 | |
|   7.1) Reorganization of copy_* functions
 | |
|        Each copy function such as copy_mm, copy_namespace, copy_files,
 | |
|        etc, had roughly two components. The first component allocated
 | |
|        and duplicated the appropriate structure and the second component
 | |
|        linked it to the task structure passed in as an argument to the copy
 | |
|        function. The first component was split into its own function.
 | |
|        These dup_* functions allocated and duplicated the appropriate
 | |
|        context structure. The reorganized copy_* functions invoked
 | |
|        their corresponding dup_* functions and then linked the newly
 | |
|        duplicated structures to the task structure with which the
 | |
|        copy function was called.
 | |
| 
 | |
|   7.2) unshare system call service function
 | |
|        * Check flags
 | |
| 	 Force implied flags. If CLONE_THREAD is set force CLONE_VM.
 | |
| 	 If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
 | |
| 	 set and signals are also being shared, force CLONE_THREAD. If
 | |
| 	 CLONE_NEWNS is set, force CLONE_FS.
 | |
|        * For each context flag, invoke the corresponding unshare_*
 | |
| 	 helper routine with flags passed into the system call and a
 | |
| 	 reference to pointer pointing the new unshared structure
 | |
|        * If any new structures are created by unshare_* helper
 | |
| 	 functions, take the task_lock() on the current task,
 | |
| 	 modify appropriate context pointers, and release the
 | |
|          task lock.
 | |
|        * For all newly unshared structures, release the corresponding
 | |
|          older, shared, structures.
 | |
| 
 | |
|   7.3) unshare_* helper functions
 | |
|        For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
 | |
|        and CLONE_THREAD, return -EINVAL since they are not implemented yet.
 | |
|        For others, check the flag value to see if the unsharing is
 | |
|        required for that structure. If it is, invoke the corresponding
 | |
|        dup_* function to allocate and duplicate the structure and return
 | |
|        a pointer to it.
 | |
| 
 | |
|   7.4) Appropriately modify architecture specific code to register the
 | |
|        new system call.
 | |
| 
 | |
| 8) Test Specification
 | |
| ---------------------
 | |
| The test for unshare should test the following:
 | |
|   1) Valid flags: Test to check that clone flags for signal and
 | |
| 	signal handlers, for which unsharing is not implemented
 | |
| 	yet, return -EINVAL.
 | |
|   2) Missing/implied flags: Test to make sure that if unsharing
 | |
| 	namespace without specifying unsharing of filesystem, correctly
 | |
| 	unshares both namespace and filesystem information.
 | |
|   3) For each of the four (namespace, filesystem, files and vm)
 | |
| 	supported unsharing, verify that the system call correctly
 | |
| 	unshares the appropriate structure. Verify that unsharing
 | |
| 	them individually as well as in combination with each
 | |
| 	other works as expected.
 | |
|   4) Concurrent execution: Use shared memory segments and futex on
 | |
| 	an address in the shm segment to synchronize execution of
 | |
| 	about 10 threads. Have a couple of threads execute execve,
 | |
| 	a couple _exit and the rest unshare with different combination
 | |
| 	of flags. Verify that unsharing is performed as expected and
 | |
| 	that there are no oops or hangs.
 | |
| 
 | |
| 9) Future Work
 | |
| --------------
 | |
| The current implementation of unshare does not allow unsharing of
 | |
| signals and signal handlers. Signals are complex to begin with and
 | |
| to unshare signals and/or signal handlers of a currently running
 | |
| process is even more complex. If in the future there is a specific
 | |
| need to allow unsharing of signals and/or signal handlers, it can
 | |
| be incrementally added to unshare without affecting legacy
 | |
| applications using unshare.
 | |
| 
 |