~edwardloveall/Scribe#11: 
Some paywalled articles require a login cookie

Articles like ones by this publisher require the user to be logged in to read. This translates to the API in the form of require a Cookie header that includes the sid and uid of a logged in user. The header looks like:

Cookie: sid=...; uid=...;

The values can be retrieved by logging into Medium and looking at cookies in the web inspector.

While this does work, it adds back the 3 article limit for premium articles. Articles fetched after that are truncated. One way to get around this might be to use a premium user's cookie, but I'm not about to ask or encourage anyone to give money to Medium.

Status
REPORTED
Submitter
~edwardloveall
Assigned to
No-one
Submitted
2 years ago
Updated
4 months ago
Labels
No labels applied.

~edwardloveall closed duplicate ticket #30 1 year, 1 month ago

Florian Kohrt closed duplicate ticket #31 1 year, 7 days ago

~edwardloveall 11 months ago

It looks like Medium is now truncating articles that are "Medium members only". They used to have a free 2-article limit, but it seems like that's gone. Here's an example: The Secrets of Retirement No One Tells You.

I'm still not sure what to do about this unless a whole bunch of people want to donate a paid account, and giving money to Medium is against the goals of the project.

~boehs 10 months ago

I’m playing with this but

  1. The google web cache has full access to member only stories

  2. This works well, with the small issue that some media is not loaded

  3. To load media, a query is made to https://medium.com/_/graphql. Nobody would like it if I wrote out the full query, but the relevant bit is

     fullContent(postMeteringOptions: $postMeteringOptions) {
       isLockedPreviewOnly
       validatedShareKey
       bodyModel {
         ...PostBody_bodyModel
         __typename
       }
    

I have not yet looked into

  1. If the media response is limited
  2. What toggling isLockedPreviewOnly does

But anyway you get back something like

                            {
                                "id": "327d9b7b7434_17",
                                "name": "9081",
                                "iframe": {
                                    "mediaResource": {
                                        "id": "f1260d3bac688acd376fdf733d116148",
                                    },
                                    "__typename": "Iframe"
                                },
                            },
  1. Depending on the media type. In the case of an iframe, you then need to go make another query based on that id, in this case it’s https://medium.com/media/f1260d3bac688acd376fdf733d116148

And then you have your media, and then I basically gave up and then decided to just find another article that wasn’t behind the paywall :)

~edwardloveall 10 months ago

Thank you for looking into this ~boehs.

And then you have your media, and then I basically gave up and then decided to just find another article that wasn’t behind the paywall :)

This has basically been my conclusion also. I'm not feeling super motivated to scrape google's cache.

It doesn't help that I view Scribe as a last resort. Basically, don't read Medium articles, but if you have to, use Scribe.

~edwardloveall closed duplicate ticket #32 7 months ago

~matthj 4 months ago*

The only 2 things I can think of is to:

-change user-agent to a web crawler (e.g. Googlebot, or Bing's web crawler user agent)

-disable javascript when fetching for content

~edwardloveall 4 months ago

~matthj thanks for the ideas.

This may be a misunderstanding of how Scribe gets its data. It uses an undocumented GraphQL api that does not expect any user agent strings or have JS enabled. I tried adding the user agent idea anyway and it actually blocked my requests 🙃

If you have any other ideas, let me know.

Register here or Log in to comment, or comment via email.