I will give you a visual example to explain the basics of X11 and what is going on in the background:
source
In this example you have a local X11-server with two "screens" on your hostA. Usually there would be only one server with one screen (:0.0), which spans across all your monitors (makes multi-monitor applications way easier). hostB has two X servers, where the second one has no physical display (e.g. virtual framebuffer for VNC). hostC is a headless server without any monitors.
terminal 1a, 2a, 5a, 6a:
If you open a local terminal, and set the display to :0.0 (default) or :0.1, the drawing calls for your graphical programs will be sent to the local X server directly via the memory.
terminal 1b, 5b:
If you ssh onto some server, usually the display will be set automatically to the local X server, if there is one available. Otherwise, it will not be set at all (reason see terminal 3).
terminal 2b, 6b:
If you ssh onto a server, and enable X11-forwarding via the "-X" parameter, a tunnel is automatically created through the ssh-connection. In this case, TCP Port 6010 (6000+display#) on hostB is forwarding the traffic to Port 6000 (X server #0) on hostA. Usually the first 10 displays are reserved for "real" servers, therefore ssh remaps display #10 (next user connecting with ssh -X while you're logged in, would then get #11). There is no additional X server started, and permissions for X-server #0 on hostA are handled automatically by ssh.
terminal 4:
If you add a hostname (e.g. localhost) in front of the display/screen#, X11 will also communicate via TCP instead of the memory.
terminal 3:
You can also directly send X11 commands over the network, without setting up a ssh-tunnel first. The main problem here is, that your network/firewall/etc. needs to be configured to allow this (beware X11 is practically not encrypted), and permissions for the X server need to be granted manually (xhosts or Xauthority).
To answer your questions
What are the relations and differences between X server, display and screen?
A display just refers to some X server somewhere. The term "both displays" was referring to ":0.0" on the local computer ("local display") being equal to "localhost:10.0" on the ssh-target ("TCP display"). "screens" is referring the different virtual monitors (framebuffers) of the X server. "localhost:10.0" is only redirecting to the local X server, there is no X server started on the ssh-target.
So does a X server start in a display or a screen?
I’m not sure how to say this in a different way than I did previously; for all intents and purposes, the X server is a display (“display” as the X Window concept, which I understand is what we’re discussing here). An X server doesn’t start in a display, it is a display. You can think of this as “an X server starts a display”, and “a display contains one or more screens”.
The DISPLAY
variable can be confusing since, as you say, it can specify more than the X display.
Which one is correct?
The diagram; see the explanation below.
Does a display server start in a display or a screen or a monitor?
In the X Window documentation, “display server” is synonymous with X server, so the above applies.
It may help to consider that the X Window documentation was written a long time ago, at a time when virtual displays weren’t used (much, if at all), and when multi-monitor setups were complex and often involved multiple X screens, and sometimes even multiple X servers. So in the X documentation, a screen is usually a monitor. However it quickly became obvious that it was annoying to split multiple monitors into multiple screens, and once graphics cards became capable of handling multiple monitors as a single unit, usage patterns changed so that X screens tended to cover multiple monitors.
Is a framebuffer associated with a display or a screen or a monitor?
“Framebuffer” is a somewhat nebulous term, with multiple definitions. In the context of the comment you’re quoting, it’s associated with a screen, and you can see this with Xvfb
: if you tell it to use memory-mapped files for its framebuffers, and define multiple screens, you’ll see it use one framebuffer file per screen.
Best Answer
In X11 terminology.
Display: at least one screen, a keyboard, and a pointing device (often a mouse).
Screen: What everyone else calls a display, monitor, or screen, but could be virtual, e.g. a region of a monitor (window).
Both screens and windows are addressable via the DISPLAY environment variable, and some other means. An application can choose which display.screen to map a window to. But it is not possible to move a window to another screen, without the application un-mapping and re-mapping it.
Monitor: This is (I think), a newer idea. Each screen can be made up of monitors. Generally application don't know about monitors, except the window manager. The window manager can freely move windows between monitors, and even overlap. All monitors are mapped as a single rectangular screen. But the window manager knows where monitors start and end, and can full-screen to just one, or detect monitor edge gestures. (I think a monitor is probably no more that a set of hints that the window manager uses). If your window manager is not monitor aware, then windows will full-screen over the whole screen.
Screens are not used much these days, at least not for interactive desktops, if using a window manager that supports monitors. However screens would be useful, when the application should be in charge, as opposed to the window manager. Though this does not seem to be necessary: Open-office presents, knows of, and uses monitors when presenting.