Gy3ZRPV8SYZ53gDjSFGpi7ej1KCaPY791pMbjB9m
Bookmark

The Curious Case of the Vanishing Virtual Threads: A Deep Dive into a Java 21 Deadlock at Netflix

The Curious Case of the Vanishing Virtual Threads: A Deep Dive into a Java 21 Deadlock at Netflix

The Curious Case of the Vanishing Virtual Threads: A Deep Dive into a Java 21 Deadlock at Netflix

Java has been a cornerstone of Netflix's microservices architecture for years. With each new Java release, the Netflix JVM Ecosystem team eagerly explores features that can enhance performance and developer productivity. The advent of virtual threads in Java 21 promised significant improvements in handling high-throughput concurrent applications, and we were excited to integrate them into our ecosystem. However, the journey wasn't without its bumps. This article delves into a peculiar deadlock scenario we encountered during our initial deployment of virtual threads, offering insights into the complexities of this new feature and our investigative process.

Introduction: Embracing Virtual Threads

Virtual threads, introduced in Java 21, are a game-changer for concurrent programming. These lightweight threads significantly reduce the overhead associated with traditional threads, making it easier to write, maintain, and monitor high-throughput applications. Their magic lies in their ability to seamlessly suspend and resume using continuations when encountering blocking operations. This frees up underlying operating system (OS) threads, allowing them to handle other tasks, thus boosting overall performance.

At Netflix, we were keen to harness the potential of virtual threads. After a successful migration to Java 21 and generational ZGC, we turned our attention to integrating this promising new feature. Little did we know that we were about to stumble upon a unique challenge.

The Problem: Mysterious Timeouts and Hangs

Shortly after deploying services utilizing virtual threads, our engineering teams began reporting intermittent timeouts and unresponsive instances. These issues, initially appearing isolated, shared a common thread: they all affected applications running Java 21 with Spring Boot 3 and embedded Tomcat serving REST endpoints. The affected instances ceased responding to traffic, yet the JVM remained active. A key symptom consistently observed was a steady rise in the number of sockets in the closeWait state.

This indicated that the remote client had closed the connection, but the server-side socket remained open, suggesting the application failed to close it properly. This often points to an application hanging in an unusual state.

Gathering Diagnostics: A Peek Under the Hood

Our initial troubleshooting involved leveraging our alerting system to capture an instance exhibiting the problematic behavior. We routinely collect and store thread dumps for all our JVM workloads, allowing us to reconstruct the sequence of events leading to the issue. However, the standard jstack generated thread dumps revealed an unexpectedly idle JVM, offering no clues.

Remembering that virtual thread call stacks aren't captured by jstack, we utilized the jcmd Thread.dump_to_file command instead. This provided a more comprehensive view, including the state of virtual threads. As a final measure, we also collected a heap dump from the affected instance.

Analyzing the Evidence: Blank Virtual Threads

The jcmd generated thread dumps revealed a surprising number of "blank" virtual threads:

#119821 "" virtual
#119820 "" virtual
#119823 "" virtual
#120847 "" virtual
#119822 "" virtual
...

These blank entries represent virtual threads that were created but never started, hence lacking a stack trace. Crucially, the number of these blank virtual threads closely mirrored the number of sockets stuck in the closeWait state.

To understand this, we needed to delve into the mechanics of virtual threads. Unlike traditional threads, virtual threads aren't directly mapped to OS threads. They operate as tasks scheduled on a fork-join thread pool. When a virtual thread encounters a blocking operation, it releases the underlying OS thread (the "carrier thread") and waits in memory until it can resume. This allows a small number of OS threads to manage a large number of virtual threads.

In our setup, Tomcat uses a blocking model, holding a worker thread for the duration of a request. With virtual threads enabled, Tomcat switches to virtual execution, creating a new virtual thread for each incoming request, scheduled on a VirtualThreadExecutor. The observed symptoms suggested Tomcat was continuously creating virtual threads for new requests, but the underlying OS threads were unavailable to execute them.

Unraveling the Mystery: Pinned Virtual Threads

The next question was: why were the OS threads unavailable? Virtual threads can become "pinned" to their carrier thread if they perform a blocking operation within a synchronized block or method. Our thread dumps revealed precisely this scenario. Several virtual threads were stuck in a park state within a synchronized block in the Brave tracing library, waiting to acquire a reentrant lock.

Since our instances had 4 vCPUs, the virtual thread fork-join pool also contained 4 OS threads. With all 4 threads pinned by these blocked virtual threads, no other virtual threads could be scheduled, effectively halting request processing in Tomcat. New requests continued to arrive, creating more virtual threads, but with no available OS threads, these new threads remained queued, holding onto their associated sockets, thus explaining the increasing number of sockets in closeWait.

Identifying the Lock Holder: A Heap Dump Deep Dive

Having pinpointed the lock contention, the next crucial step was to identify the lock holder. Thread dumps typically reveal this information, but our dumps lacked this crucial detail due to a limitation in Java 21. We turned to the heap dump and the Eclipse MAT tool to analyze the state of the lock.

We located the ReentrantLock object and meticulously examined its internal state. The exclusiveOwnerThread field was null, indicating no thread currently held the lock. The lock's internal wait queue revealed a complex scenario: a seemingly released lock with a waiting virtual thread poised to acquire it. This suggested a transient state between lock release and acquisition, yet our JVM remained frozen in this state.

The Deadlock: A Lock and a Limited Resource

Reconciling the thread dump and heap dump analysis, we realized the deadlock's nature. While one virtual thread had been notified to acquire the lock, it couldn't proceed due to the lack of available OS threads in the fork-join pool. The other virtual threads, pinned while waiting for the same lock, held onto their OS threads, creating a circular dependency. It was a deadlock, not involving two locks, but a lock and the limited resource of OS threads in the fork-join pool.

Conclusion: Lessons Learned and Looking Ahead

This investigation highlighted a subtle but critical interaction between virtual threads, synchronization, and limited resources. While virtual threads offer significant performance benefits, they introduce new complexities that require careful consideration. The specific issue we encountered was a consequence of a blocking operation within a synchronized block, combined with the limited size of the fork-join pool.

This experience underscores the importance of thorough testing and performance analysis when adopting new technologies. It also emphasized the need for improved tooling and diagnostics for virtual threads, a need that subsequent Java releases are addressing.

We remain enthusiastic about the potential of virtual threads and are continuing to integrate them into our systems. This investigation, while challenging, provided valuable insights into the intricacies of virtual threads and strengthened our approach to performance engineering at Netflix. We hope this detailed account of our investigative process proves beneficial to the wider Java community as they embark on their virtual thread journeys.

Posting Komentar

Posting Komentar