Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

header-generator fails luminati botcheck #152

Open
corford opened this issue Mar 21, 2023 · 5 comments
Open

header-generator fails luminati botcheck #152

corford opened this issue Mar 21, 2023 · 5 comments
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@corford
Copy link

corford commented Mar 21, 2023

Describe the bug

Multiple header and TLS tests fail when visiting: https://botcheck.luminati.io/

To Reproduce

  • got-scraping 3.2.13
  • header-generator 2.1.17
  • crawlee 3.3.0
  • ubuntu focal

Headers present on request (when emulating Chrome 110 Windows):

Host: headers.cf
Connection: close
Content-Length: 0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Sec-Fetch-Mode: navigate
Sec-Fetch-Dest: document
Sec-Fetch-Site: same-site
Sec-Fetch-User: ?1
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Sec-Ch-Ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"
Sec-Ch-Ua-Mobile: ?0
Sec-Ch-Ua-Platform: "Windows"

Response from botcheck:

Type navigate
PASS User agent
FAIL Header values: sent headers do not match what is expected
  sec-fetch-site
    + same-site
    - none
PASS Header case
FAIL Header order: header order is incorrect
  :path
    + 1
    - 3
  :authority
    + 2
    - 1
  :scheme
    + 3
    - 2
PASS HTTP version
PASS TLS version
FAIL TLS cipher
  + 130113021303c02bc02fc02cc030cca9cca8c013c014009c009d002f003500ff
FAIL Http2 settings
  headerTableSize
    + 4096
    - 65536
  initialWindowSize
    + 33554432
    - 6291456
  maxConcurrentStreams
    + 4294967295
    - 1000
  maxHeaderListSize
    + 4294967295
    - 262144
  maxHeaderSize
    + 4294967295
    - 262144

Headers present on request (when emulating Firefox 110 Windows):

Host: headers.cf
Connection: close
Content-Length: 0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Upgrade-Insecure-Requests: 1
Sec-Fetch-Mode: navigate
Sec-Fetch-Dest: document
Sec-Fetch-Site: same-site
Sec-Fetch-User: ?1

Response from botcheck:

Type navigate
PASS User agent
PASS Header values
WARN Tricky headers: using these headers incorrectly may impact success rate
  - sec-fetch-site
  - sec-fetch-mode
  - sec-fetch-user
  - sec-fetch-dest
PASS Header case
PASS Header order
PASS HTTP version
PASS TLS version
FAIL TLS cipher
  + 130113031302c02bc02fcca9cca8c02cc030c00ac009c013c014009c009d002f003500ff
FAIL Http2 settings
  headerTableSize
    + 4096
    - 65536
  enablePush
    + false
    - true
  initialWindowSize
    + 33554432
    - 131072

Expected behaviour

All tests should pass (which is the case if you visit https://botcheck.luminati.io/ using a real Chrome 110 or Firefox 110 browser on Windows).

System information:

  • OS: Ubuntu focal
  • Node.js version: v18.11.0

Additional context

Add any other context about the problem here.

@corford corford added the bug Something isn't working. label Mar 21, 2023
@corford corford changed the title header-genertator fails luminati botcheck header-generator fails luminati botcheck Mar 21, 2023
@mnmkng
Copy link
Member

mnmkng commented Mar 22, 2023

@Equidem could you please take a look at the header orders and incorrect headers present together? That points to some issue in the generation code or the way we process the fingerprints we collect. Since we work with correct fingerprints, we should not get incorrect results.

The TLS is a bit tricky because Node.js does not allow the same level of configuration, but we can try to look again at this as well.

@barjin
Copy link
Collaborator

barjin commented Mar 22, 2023

I guess @Equidem won't recognize his own code, I have basically rewritten the whole thing since it was forged in the ancient flames 😄

This is likely related to apify/got-scraping#65 (mentioning the same problems) and will be partially solved by #149, which introduces an automatic way of updating the header orders. As far as I can see, the only incorrect header is sec-fetch-site, which does not identify the user - it says in what context the request was made (think CORS - same-site, cross-site...) Since got-scraping cannot execute client-side JS, the only valid value here is none (user initiated request). This is not a problem with the collected data, but with our methodology - which is easy to fix in got-scraping.

@barjin barjin self-assigned this Jul 21, 2023
@barjin barjin added the t-tooling Issues with this label are in the ownership of the tooling team. label Jul 21, 2023
@yovanoc
Copy link

yovanoc commented Sep 21, 2023

even arc or opera don't pass this check

@tenkuken
Copy link

I found some antibot detection based on http header orders.
https://my.f5.com/manage/s/article/K13527565

@Suniron
Copy link

Suniron commented Oct 21, 2023

Edge Chromium don't pass the check also.. 🙄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

6 participants