Articles like ones by this publisher require the user to be logged in to read. This translates to the API in the form of require a Cookie
header that includes the sid and uid of a logged in user. The header looks like:
Cookie: sid=...; uid=...;
The values can be retrieved by logging into Medium and looking at cookies in the web inspector.
While this does work, it adds back the 3 article limit for premium articles. Articles fetched after that are truncated. One way to get around this might be to use a premium user's cookie, but I'm not about to ask or encourage anyone to give money to Medium.
It looks like Medium is now truncating articles that are "Medium members only". They used to have a free 2-article limit, but it seems like that's gone. Here's an example: The Secrets of Retirement No One Tells You.
I'm still not sure what to do about this unless a whole bunch of people want to donate a paid account, and giving money to Medium is against the goals of the project.
I’m playing with this but
The google web cache has full access to member only stories
This works well, with the small issue that some media is not loaded
To load media, a query is made to https://medium.com/_/graphql. Nobody would like it if I wrote out the full query, but the relevant bit is
fullContent(postMeteringOptions: $postMeteringOptions) { isLockedPreviewOnly validatedShareKey bodyModel { ...PostBody_bodyModel __typename }
I have not yet looked into
- If the media response is limited
- What toggling isLockedPreviewOnly does
But anyway you get back something like
{ "id": "327d9b7b7434_17", "name": "9081", "iframe": { "mediaResource": { "id": "f1260d3bac688acd376fdf733d116148", }, "__typename": "Iframe" }, },
- Depending on the media type. In the case of an iframe, you then need to go make another query based on that id, in this case it’s https://medium.com/media/f1260d3bac688acd376fdf733d116148
And then you have your media, and then I basically gave up and then decided to just find another article that wasn’t behind the paywall :)
Thank you for looking into this ~boehs.
And then you have your media, and then I basically gave up and then decided to just find another article that wasn’t behind the paywall :)
This has basically been my conclusion also. I'm not feeling super motivated to scrape google's cache.
It doesn't help that I view Scribe as a last resort. Basically, don't read Medium articles, but if you have to, use Scribe.
The only 2 things I can think of is to:
-change user-agent to a web crawler (e.g. Googlebot, or Bing's web crawler user agent)
-disable javascript when fetching for content
~matthj thanks for the ideas.
This may be a misunderstanding of how Scribe gets its data. It uses an undocumented GraphQL api that does not expect any user agent strings or have JS enabled. I tried adding the user agent idea anyway and it actually blocked my requests 🙃
If you have any other ideas, let me know.