GraphQL Investigation, Haaretz Newspaper (Israel)

Preface

This article details a quick investigation into a public GraphQL interface of a major Israeli daily newspaper — Haaretz (English edition).

The short piece is structured around the flow of an investigative session, with various command line and interactive tools being mentioned and used. The avid reader can then use this structure as a basis for investigating other GraphQL endpoints.

Beyond that, there might be some value in the exposure of this specific endpoint, for Data Journalists and other interested parties, assuming that it will not be restricted in the near future.

As I’m a subscriber of the said newspaper, I have a valid (user) ID, which has been used during this investigation. I invested some effort to remove references to this ID from the article. If you do find it, I ask that you refrain from using it. To reproduce this session, you would need a valid ID with haaretz.co.il (free to register, pay for extensive use).

No effort has been spent on explaining the tools, detailing their installation or providing advice as to why they were used. If you keep up with the pace, I assume you will know how to install and use them effectively.

Enjoy the ride.

Initial Contact

While on a random DNS sniff, I encountered a GraphQL hostname for my daily newspaper. As it is Corona time, and we are essentially grounded, I though it might be fun to investigate this further. So we here go:

Some GraphQL server implementations provide an interactive playground. It seems like it’s open on this one:

This type of playground also provides access to schema and documentation. We’ll save these soon, but first let’s look through the Docs for a version query and execute it:

Nice. So we have a working GraphQL interface here!

Before it goes away, or we get blocked for some reason, lets save the schema using graphqurl, a command line tool executing GraphQL requests (the playground also allows for saving the schema in json format — do that if you like to):

We can also execute the same version query through the command line interface:

Next we’ll try and use the interface for something more meaningful.

What Are We Looking At?

The newspaper’s main site was open on my browser at the time of the initial capture of the DNS query. This is how it looks like right now (lately it is trying to sell me hats …):

My assumption is that the GraphQL endpoint is used by this page. Let’s fire up Wireshark and look for traffic:

There is definitely traffic there, SSL though. We’ll need another tool to examine the data flow; something located closer to the requesting endpoint (i.e. our browser).

Let’s fire up the Developer Tools on the Chrome browser and see if we find a request to this host. Indeed there are requests being issued from time to time:

Opening one of them reveals some data being sent and a json received. Let’s try and execute it using curl:

Plugging the id into a GraphQL query, as found in the schema, reveals some following personal data (name, phone number).

[note that this request has no further context (auth-token, etc). It might be interesting to try it on another id — one that is not currently of the logged-in user — but we are not hacking the system here, just exploring, so no need to fish further]

Data Journalism Useful? Scraping? etc.

Can we use this interface for get interesting content? Let us scan the schema for article/news related stuff.

The Query named “breakingNewsBox(…)” seems to be interesting. Note the additional HTTP headers added to the request. Non of them seem to be session sensitive.

Let’s transfer this to graphqurl and add some filtering using jq to extract just the titles text (and a bit of awk for good measure)

Definitely some potential there!

Want to Dive Deeper?

I will leave you with a two ideas on further exploration of this interface.

I hope this has been a joyful and insightful ride. I have certainly enjoyed the few hours spent investigating and writing this article on this Saturday morning.

  1. Explore an item from the breaking news, follow its contentId, obtain its contents, etc.:

2. Explore queries in an interactive manner with the specified headers (using the -i flag).

Furlough Fraud, Whistleblowing, DarkWeb, Data Journalism, #Birmingham (researcher/journalist)