dev-resources.site
for different kinds of informations.
Chromium Spelunking: A Stuck Task
The last installment was a bit of a detour from our task -- fetching a URL. But, it was a deliberate one, to learn about things I didn't understand well enough. Before that, I had gotten churl
up to the point where it crashed because its delegate was NULL. So, let's give it a delegate!
Delegates
Delegates are a common pattern in Chromium, forming a way for lower-level code to interact with higher-level code. It's a kind of dependency inversion. For example, an embedder like Chrome or Fuchsia provides a delegate (several, in fact) to the content layer, which that content layer uses to call back for embedder-related features like autofill.
In this case, we're providing a delegate to the URLRequest
, a subclass of net::URLRequest::Delegate
. Happily, most of the methods have default implementations, and in fact the only required method is OnReadCompleted
. My delegate overrides that method as well as OnConnected
and OnResponseReceived
, just to log when those events occur. In fact, I copied from TestDelegate
, including the buf_
stuff and the Read
calls.
So, at this point I have a delegate:
class ChurlDelegate : public net::URLRequest::Delegate {
public:
ChurlDelegate()
: buf_(base::MakeRefCounted<net::IOBuffer>(kBufferSize)) {}
~ChurlDelegate() override {}
int OnConnected(net::URLRequest* request,
const net::TransportInfo& info,
net::CompletionOnceCallback callback) override {
DLOG(ERROR) << "OnConnected";
return net::URLRequest::Delegate::OnConnected(request, info, std::move(callback));
}
void OnResponseStarted(net::URLRequest* request, int net_error) override {
DLOG(ERROR) << "OnResponseStarted";
if (net_error != net::OK) {
DLOG(ERROR) << "Error: " << net_error;
request->Cancel();
return;
}
// TODO: print response and headers?
int bytes_read = request->Read(buf_.get(), kBufferSize);
if (bytes_read >= 0) {
DLOG(ERROR) << "Read completed immediately";
OnReadCompleted(request, bytes_read);
} else if (bytes_read != net::ERR_IO_PENDING) {
DLOG(ERROR) << "Error from READ: " << bytes_read;
request->Cancel();
return;
}
}
void OnReadCompleted(net::URLRequest* request, int bytes_read) override {
DLOG(ERROR) << "OnReadCompleted with " << bytes_read << " bytes_read";
do {
DLOG(ERROR) << "GOT: " << buf_->data();
bytes_read = request->Read(buf_.get(), kBufferSize);
} while (bytes_read > 0);
}
private:
scoped_refptr<net::IOBuffer> buf_;
};
and the Churl class:
class Churl {
public:
void Fetch() {
net::URLRequestContextBuilder url_request_context_builder;
url_request_context_builder.set_user_agent("Dustin's Experiment");
url_request_context_builder.set_proxy_config_service(
std::make_unique<ProxyConfigServiceDirect>());
auto url_request_context = url_request_context_builder.Build();
auto url_request = url_request_context->CreateRequest(
GURL("http://httpbin.org/get"), // TODO: get this from command line
net::RequestPriority::HIGHEST,
&delegate_,
net::NetworkTrafficAnnotationTag(net::MutableNetworkTrafficAnnotationTag(TRAFFIC_ANNOTATION_FOR_TESTS)),
false);
DLOG(ERROR) << "calling start";
url_request->Start();
DLOG(ERROR) << "started";
}
private:
std::unique_ptr<net::URLRequestContext> url_request_context_;
ChurlDelegate delegate_;
};
and all of this created in main
with
Churl churl;
base::ThreadPool::CreateSingleThreadTaskRunner({})->PostTask(
FROM_HERE, base::BindOnce(&Churl::Fetch, base::Unretained(&churl)));
Note that the delegate is passed to CreateRequest
as a raw pointer. It's up to me to make sure that the delegate continues to exist for the duration of that request, and it will crash otherwise. Ask me how I know!
I put it into an instance variable in Churl
, since that Churl
instance lives until the end of main
. The receiver for the Fetch
method needs to be included in the task callback, and that's done with base::Unretained
to indicate that the callback does not "own" the pointer.
An Easy Fix
So, the first run fails with this:
[0525/191840.310712:FATAL:command_line.cc(247)] Check failed: current_process_commandline_.
A little exploring of other "utility" tools like root_store_tool
suggests
base::CommandLine::Init(argc, argv);
And indeed, this gets things running -- sort of.
[0530/190907.523753:ERROR:churl_bin.cc(190)] URLRequestContext created: 0x7f4874003780
[0530/190907.524119:ERROR:churl_bin.cc(198)] calling start
[0530/190907.526291:ERROR:churl_bin.cc(200)] started
That's it -- no logging from OnConnected
. Maybe not so easy?
Finding a Stuck Task
I am sort of old-fashioned, and I'd rather debug things by running them and watching the output, than using fancy tools like lldb
. My usual tactic is to add lots of debug logging (in Chromium, DLOG(ERROR) << "got here 1"
) to narrow down where a problem occurs. In this case, Start()
started things off, but got "stuck" somewhere.
Looking at URLRequest::Start
, it creates a new job with a job factory, and I can verify that it does, indeed call StartJob(..)
with this new job by adding a debug print after that line.
That job is a subclass of URLRequestJob
, but which subclass? Codesearch gives a list of subclasses, and URLRequestHttpJob
seems a likely suspect. Adding some debug prints there confirms this guess. URLRequestJob
then calls job->Start()
.
URLRequestHttpJob
does its work in a series of callback tasks. Each possible bit of work is a separate method, and any conditionals or loops are accomplished "manually" by conditionally calling or posting a task for a different method. It's almost like assembly language, tracing out all of the conditional jumps and control-flow cycles. In this case, adding debug prints to the beginning of each method quickly shows that StartTransactionInternal
is the last method called.
HttpRequestJob::Start
The StartTransactionInternal
method seems to set up a transaction and call its Start
method with a callback bound to URLRequestHttpJob::OnStartCompleted
. If the Start
method finishes immediately, it calls OnStartCompleted
directly. This answers my question from the first post: when a function takes a callback, but does not return ERR_IO_PENDING
, it also does not invoke the callback and expects its caller to do so. This seems like it would be a source of bugs, if one of those two control-flow paths is rarely taken.
Anyway, OnStartCompleted
is never called, and one more debug print shows that Start
is returning ERR_IO_PENDING
so Start
should be calling it.
I briefly wondered if this was another problem with threads and tasks, where the process was exiting before the task had a chance to run. Maybe FlushForTesting
sees no tasks running while some kind of IO is pending, and returns? I added a sleep in the main task (base::PlatformThread::Sleep(base::Seconds(5))
) and OnStartCompleted
still did not run, so it's nothing quite that simple.
So let's dive one layer deeper. URLRequestHttpJob
is using a factory to make an instance of an interface, in this case an HttpTransactionFactory
creating an HttpTransaction
.There's no easy way to print the class name of an object, but codesearch can (usually) list the subclasses of an interface, so a bit of guess-and-check works. My first guess was an HttpNetworkTransaction
, but a debug print in its Start
method did not appear. Then I remembered that the HttpCache
is a "read-through" cache, and a debug print confirms that the HttpTransaction
is an HttpCache::Transaction
.
HttpTransaction::Start
The HttpCache::Transaction
class uses the DoLoop pattern. I used a quick vim macro to print the states as they occurred:
[0525/195452.887052:ERROR:http_cache_transaction.cc(861)] STATE_GET_BACKEND
[0525/195452.887096:ERROR:http_cache_transaction.cc(866)] STATE_GET_BACKEND_COMPLETE
[0525/195452.887189:ERROR:http_cache_transaction.cc(870)] STATE_INIT_ENTRY
[0525/195452.887230:ERROR:http_cache_transaction.cc(875)] STATE_OPEN_OR_CREATE_ENTRY
[0525/195452.887406:ERROR:http_cache_transaction.cc(880)] STATE_OPEN_OR_CREATE_ENTRY_COMPLETE
[0525/195452.887450:ERROR:http_cache_transaction.cc(902)] STATE_ADD_TO_ENTRY
So, it's stuck after DoAddToEntry
. More debug prints there show that it's returning the rv
from cache_->AddTransactionToEntry
. cache_
is an HttpCache
, so off we go!
HttpCache::AddTransactionToEntry
// Adds a transaction to an ActiveEntry. This method returns ERR_IO_PENDING
// and the transaction will be notified about completion via its IO callback.
// In a failure case, the callback will be invoked with ERR_CACHE_RACE.
int AddTransactionToEntry(ActiveEntry* entry, Transaction* transaction);
So, what's the transaction's "IO callback" in this case? It has an io_callback_
property, and elsewhere in http_cache.cc
I see transaction->io_callback().Run(OK)
so at a guess that's what it's calling. But, what is that set to at the time? The HttpCache::Transaction
constructor sets this to a binding for the OnIOComplete
method, and it seems it's only set to other things in order to support testing. And that just calls DoLoop
. And that would log. So, what's going on in the cache that causes it not to call the IO callback?
Adding some more debug prints to HttpCache
, it seems that ProcessQueuedTransactions
is executing, and posting a task, but HttpCache::OnProcessQueuedTransactions
never executes.
I got stuck here for a bit, and decided I'd put in enough work to ask for help. Pairing with a colleague, we pored over the code, looking for what I might have missed. Nothing jumped out as obviously wrong.
As an experiment, we tried posting another task that just contained a lambda expression logging a message. That task ran! So, this isn't an issue of the thread pool not running tasks.
base::SingleThreadTaskRunner::GetCurrentDefault()->PostTask(
FROM_HERE,
base::BindOnce(&HttpCache::OnProcessQueuedTransactions, GetWeakPtr(),
base::UnsafeDanglingUntriaged(entry)));
The GetWeakPtr()
in the PostTask
invocation is curious, though -- what happens if that pointer becomes invalid before the task runs? Replacing it with base::Unretained(this)
"fixes" the issue: the callback gets called. So it seems like the task machinery detects that a weak pointer has been invalidated and drops the task depending on it -- a pretty smart way to avoid use-after-free errors! That means that the HttpCache
is getting destroyed before this callback can begin.
HttpCache
Lifetime
So, let's see if we can figure out what's creating that and how long it lasts. It looks like HttpCache
is a subclass of HttpTransactionFactory
, and the URLRequestContext
references it as http_transaction_factory_
. That's set by URLRequestContextBuilder
.
Ah, the URLRequestContext
is in a local variable so it is being destroyed while the URLRequest
is still in progress. So that's no good. Rearranging things to keep the context as an instance variable of the Churl
class fixes the issue.
Which is not to say it's working, but that's a job for the next post in the series:
[0605/175824.724186:ERROR:churl_bin.cc(190)] URLRequestContext created: 0x7fb224002df0
[0605/175824.725417:ERROR:churl_bin.cc(198)] calling start
[0605/175824.730763:ERROR:churl_bin.cc(200)] started
[0605/175824.734067:FATAL:current_thread.cc(197)] Check failed: sequence_manager.
#0 0x7fb23107ca8c base::debug::CollectStackTrace()
#1 0x7fb2310332da base::debug::StackTrace::StackTrace()
#2 0x7fb231033295 base::debug::StackTrace::StackTrace()
#3 0x7fb230d575f9 logging::LogMessage::~LogMessage()
#4 0x7fb230d02bac logging::(anonymous namespace)::DCheckLogMessage::~DCheckLogMessage()
#5 0x7fb230d02bd9 logging::(anonymous namespace)::DCheckLogMessage::~DCheckLogMessage()
#6 0x7fb230d028bd logging::CheckError::~CheckError()
#7 0x7fb230ed6cc2 base::CurrentIOThread::Get()
#8 0x7fb232194f2d net::SocketPosix::Connect()
#9 0x7fb232199906 net::TCPSocketPosix::Connect()
[update: it turns out it's also necessary to keep the URLRequest
in an instance variable]
Featured ones: