Disclaimer: I enjoy reading this article, thus I decided to translate this article from Chinese to English for a broader audience. This is for fair use only, the original copyrights are owned by Dr. Tong and MSRA.

The original article is at http://blog.sina.com.cn/s/blog_4caedc7a0102wn0l.html 

Dr. Tong Xin is currently the Principal Researcher of the Internet Graphics Group at Microsoft Research Asia. He graduated from Zhejiang University in 1993 with a bachelor’s degree in computer science. He received a master’s degree in computer science from Zhejiang University in 1996 and a Ph.D. degree in computer science from Tsinghua University in 1999. He joined Microsoft Research Asia in the same year. At present, he is mainly engaged in research of computer graphics.


Reporter: As a graphics researcher, when did you start to investigate VR (virtual reality) and MR (mixed reality)? The concept of VR is initiated in the 1960s and 1970s, but did not enter the mainstream. What do you think is the technical bottleneck?

Tong Xin: The concept of VR and AR (augmented reality) has been existed since the very beginning of computer graphics research. In the early 1990s, the high-end VR system in graphics, called the CAVE System, was developed by EVL, which consists of several display screen walls that provide an immersive environment with graphic workstations at the back of each screen. With stereoscopic glasses, tracking device and data gloves, users can move freely inside; the content will change with the user’s interaction. This was a very high-end, very expensive system. At that time, VR was more for industrial applications (such as the US space system) or military applications. Since PC appears, we began to replace the original supercomputer with multiple PCs. By putting up all the displays, we can be make a CAVE system with PCs. This is one clue to the development of VR. Another clue is a VR system with head-mounted displays for a single user.

AR also starts very early, but mainly for exclusive industrial applications. First applications were for military. For example, a military factory needs to repair a missile, with the AR system, one can know where to open, what kind of operations need to be done, which greatly improve the efficiency and reduce errors. For the military industries, as long as there is an increase in efficiency, this cost does not matter. Another application is the manufacturing and maintenance of aircraft. Boeing has adopted the AR system since long time ago. They place a small piece of translucent display in front of the users’ eyes, showing the contents of the superimposed information on the real scene – for very difficult and complicated tasks such as aircraft assembling and repairing. Such tasks require people to read the maintenance guide. Such AR system is helpful even if only showing the manual of checklist.

As for the recent wave of the rise of VR / AR, one reason is the development of the mobile phone industry over the past few years, so that the miniaturization of all the sensors and monitors, and the price really allows everyone to accept. Meanwhile, with the development of the computing power of personal computers and GPUs, the computation speed is fast enough for displaying contents. Taking all of these conditions into consideration, can we provide a better VR experience for an ordinary user.

VR is always there, but was very expensive and exclusive for minorities before, and now slowly and finally developed to the time for the average user.

Reporter: In this wave of VR / AR,  what are the break-through of the technologies in computer graphics, human-computer interaction and sensing?
Tong Xin: There are a few of the most important advances. First of all, in the past, the sensor has been very precise, but the price is very expensive. After the popularity of smart phones, mass production of sensors, the price has become very cheap, the development of the sensor so that a lot of positioning technology has improved. GPU development is also very fast, in such a high resolution, can do a very real display – of course, behind all this, but also a lot of real-time algorithm support can be combined with real-time sensor data positioning, While a strong sense of the content displayed. On the other hand is the progress of interactive technology. Now in VR, the interaction is basically rely on equipment, can now help the user through the sensor to determine the location of the virtual environment and head orientation, as well as input, with the game controller or rely on voice, gestures, providing good Natural interactive experience. This is also a great breakthrough.

Microsoft’s HoloLens, for example, Microsoft has been in the natural interaction, VR and AR in the field of key inputs and research and development. Realistic real-time reality, Microsoft developed a number of algorithms and provided to users through Direct3D, and the development of the GPU to promote each other, to give users a more realistic content experience. The natural interaction aspect is Kinect, which is the first time to push the experience of natural somatosensory interaction to the user, with inexpensive depth camera, combined with the latest algorithm to achieve real-time user gesture recognition and tracking. Recently introduced holographic glasses HoloLens is to all the interaction, display the latest technology and hardware into a large. Not only on the hardware display of the waveguide, holographic processor (HPU, Holographic Processing Unit), and the entire computing platform wearable and small, more real-time software location and scene reconstruction technology, speech recognition and gesture recognition Of the technology … … All of these combined together to make the reality of mixed-technology landing and alive, in order to give users a new experience.

Reporter: With the advances of HoloLens, what technology breakthrough have we achieved in recent years?

Tong Xin: I think there are several significant advances. First, at the hardware level, such a small weight but to include a headset display. HoloLens did it, equipped with a See-Through screen, semi-transparent, can see the outside, while the content to be displayed on the above, the resolution should be high enough. On the other hand, HoloLens is a headset computer, all the computing unit, including batteries are integrated in the above, these are not taken for granted, but to balance many aspects. Hardware integration into the need to ensure its high quality work, life can persist for a long enough time, for example, 3 to 4 hours … … All of these are dependent on the progress of hardware and process improvement. With these are not enough, but also have the most basic software to support hardware. On the mixed reality, the core technology is called SLAM (Simultaneous Localization and Mapping), is the real-time location and scene modeling technology. What does this mean? When I do VR, because the entire field of vision immersed in the virtual environment, I only need according to the calculation of the location, showing the entire virtual scene. Now I am in a real world, the virtual thing can be calculated according to my point of view of mobile, but the real environment of the object will not. So I want to know your head in the real world in the exact location, so that the virtual thing and the real thing will be mixed real.

For example, I wanted to show a virtual cup at the corner of the table. I turned around and looked back. The real-world tables and cups were still there, but if my calculations were inaccurate, the position of the virtual cup moved. . But in the virtual world, how do I know you are looking at the original location, I should show this thing in the original location? This requires the computer to know in the real world, where I am now, I see where this thing must be calculated in real time, and must be very stable, can not be disturbed, or the user will feel that the content displayed in the air … … This challenge is very difficult. Microsoft’s HoloLens through the camera and very advanced algorithms, including dedicated HPU to carry out all the calculations, all of the location information can be provided to you in real time. These things are all the most critical techniques for making augmented reality, especially mixed reality (MR).

At the same time, we also recognize that we need a series of technical support behind this new experience of hybrid reality. It requests specialized algorithm with high threshold, no matter in content generation, intelligent interaction, or the high-level intelligent understanding and interactions. If only a few large companies generate the content and conduct the development, they may still not meet everyone’s needs. The best way is to build an ecosystem, we not only provide a benchmark such as HoloLens hardware, but also provide a software platform like Holographic. By turning different algorithms and services into APIs that can be used by ordinary users, those who want to develop applications can use our tools and services for development, and turn their ideas into HoloLens. The same program can be used onto other VR and MR devices.

Reporter: Virtual reality and mixed reality has opened an era of immersive 3D graphical displays. What are the issues that need to be addressed in 3D graphics?

Tong Xin: Shading with lighting and shadows in computer graphics are called “rendering”. Rendering lighting and shadows in real-time has always been a hotspot in the research community. At Microsoft Research, we are the first to use the machine learning approach to deal with this problem. For the first time, we achieve some of the very complex lighting effects in real-time. We believe that with the development of these technologies, there will be more cool lighting effects presented to everyone in VR and MR.

Another problem is how to generate more realistic 3D scenes and interactive contents more conveniently. Traditionally, we need artists to shape, but another way is from the real world to capture capture directly. For example, I want to make a coffee shop, before the artist to use three-dimensional modeling software to do it manually, including all the details, this is a way. Another way is to take a depth camera or an ordinary camera, a cafe all the tables and walls of the geometry, material completely captured down, put it in the three-dimensional scene, the real sense of it improved, all the desktop Of the material will be very real. With this technology, the artist is not out of nothing, he can in this scene on the basis of the material change, such as the table rust rust, it becomes more texture. Therefore, content capture technology is a very important technical path. Microsoft Research in this direction has done a lot of research work, our goal is to allows ordinary users to enjoy this technology with our efforts, such as the Kinect depth cameras, or even like a mobile phone or a normal camera The user interested in the three-dimensional objects and geometric shapes of the surface of the rich material, light and shadow effects are captured, the perfect reproduction in the virtual world. Once this problem is resolved, all ordinary users can produce high-quality three-dimensional content. Then the virtual world and mixed-reality world will become rich and colorful, the user experience will increase an order of magnitude.

Reporter: What other issues need to be solved for mixed reality to become practical?

Tong Xin: First, from an interactive point of view, there must be location, there must be voice and gestures, expressions and other natural way of interaction. These technologies also need to be further mature. If the high-quality output and the user’s input does not match, the user will find this technology is not easy to use, and not natural. One problem that is often overlooked is the level of intelligent perception. To make the mixed-reality experience better, we need to have better next-generation artificial intelligence and recognition technology.

For example, in a scene, when I wear AR glasses to operate something. Positioning technology tells the computer I’m staring at something, but what is it? It may be necessary to “know” through the identification technique that this is a remote control. Then the system knows that the user wants to use the remote control, the information of the remote control could then be extracted from the database to the user, into a number of visual guide to the user that you press the key bar, according to the user gestures And the problems that arise, give the user further guidance – you can see that in this simple example, the natural interaction, display, recognition, all of these have to be together, this scene will work. If the inside lacks anything, the end will become, it sounds very good, the user has just begun to feel very fresh, but soon found that the operation than the original equipment and methods more trouble, then naturally bring the user expectations and actual results Between the huge gap. So Microsoft hopes to carry out research at all levels, providing solutions to narrow the gap, so that mixed reality become useful things to the users.

In the long term, AR is far more extensive than VR in terms of application scopes, and will permeate all aspects of life in the future. When you put VR, you can not see the real environment, but entirely experience in the virtual world. Mixed reality is more imaginable as visual aids like things that greatly enhance and facilitate your life in the real world. However, the threshold of AR technology is higher, so we feel that the popularity of AR may be much later than the VR.

Some people said we have to wait 10 years for AR to become mature, my personal prediction may be more optimistic. There are two reasons:

  • First, many fundamental intelligent sensing technology that AR replies on, are developed faster than we thought
  • Second, with the maturity of AI technologies, the perception layer in AR is developed faster. For example, object recognition has developed tremendously recently. The progress of these technologies can greatly promote AR technologies. If these things mature faster than before, scene applications of AR will be faster to come, but the specific time is difficult to predict from my perspective, because the speed of technology development is actually too fast.