Commit Graph

41 Commits

Author SHA1 Message Date
a34075d205 SECURITY: Onebox response timeout and size limit (#15927)
Validation to ensure that Onebox request is no longer than 10 seconds and response size is not bigger than 1 MB
2022-02-14 12:11:09 +11:00
53abcd825d FIX: Canonical URLs may be relative (#14825)
FinalDestination's follow_canonical mode used for embedded topics should work when canonical URLs are relative, as specified in [RFC 6596](https://datatracker.ietf.org/doc/html/rfc6596)
2021-11-05 14:20:14 -03:00
4c2d5158c5 FIX: Follow the canonical URL when importing a remote topic. (#14489)
FinalDestination now supports the `follow_canonical` option, which will perform an initial GET request, parse the canonical link if present, and perform a HEAD request to it.

We use this mode during embeds to avoid treating URLs with different query parameters as different topics.
2021-10-01 12:48:21 -03:00
2f28ba318c FEATURE: Onebox can match engines based on the content_type (#13876)
* FEATURE: Onebox can match engines based on the content_type

`FinalDestination` now returns the `content_type` of a resolved URL.

`Oneboxer` passes this value to `Onebox` itself. Onebox engines can now specify a `matches_content_type` regex of content_types that the engine can handle, regardless of the URL.

`ImageOnebox` will match URLs with a content type of `image/png`, `jpg`, `gif`, `bmp`, `tif`, etc.

This will allow images that exist at a URL without a file type extension to be correctly rendered, assuming a valid `content_type` is returned.
2021-07-30 13:36:30 -04:00
19182b1386 DEV: Oneboxer wildcard subdomains (#13015)
* DEV: Allow wildcards in Oneboxer optional domain Site Settings

Allows a wildcard to be used as a subdomain on Oneboxer-related SiteSettings, e.g.:

- `force_get_hosts`
- `cache_onebox_response_body_domains`
- `force_custom_user_agent_hosts`

* DEV: fix typos

* FIX: Try doing a GET after receiving a 500 error from a HEAD

By default we try to do a `HEAD` requests. If this results in a 500 error response, we should try to do a `GET`

* DEV: `force_get_hosts` should be a hidden setting

* DEV: Oneboxer Strategies

Have an alternative oneboxing ‘strategy’ (i.e., set of options) to use when an attempt to generate a Onebox fails. Keep track of any non-default strategies that were used on a particular host, and use that strategy for that host in the future.

Initially, the alternate strategy (`force_get_and_ua`) forces the FinalDestination step of Oneboxing to do a `GET` rather than `HEAD`, and forces a custom user agent.

* DEV: change stubbed return code

The stubbed status code needs to be a value not recognized by FinalDestination
2021-05-13 15:48:35 -04:00
da9b837da0 DEV: More robust processing of URLs (#11361)
* DEV: More robust processing of URLs

The previous `UrlHelper.encode_component(CGI.unescapeHTML(UrlHelper.unencode(uri))` method would naively process URLs, which could result in a badly formed response.

`Addressable::URI.normalized_encode(uri)` appears to deal with these edge-cases in a more robust way.

* DEV: onebox should use UrlHelper

* DEV: fix spec

* DEV: Escape output when rendering local links
2020-12-03 17:16:01 -05:00
e0d9232259 FIX: use allowlist and blocklist terminology (#10209)
This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.
2020-07-27 10:23:54 +10:00
b0e781e2d4 FIX: do not follow redirect on same host with path /login or /session 2019-08-07 16:26:55 +05:30
4ea21fa2d0 DEV: use #frozen_string_literal: true on all spec
This change both speeds up specs (less strings to allocate) and helps catch
cases where methods in Discourse are mutating inputs.

Overall we will be migrating everything to use #frozen_string_literal: true
it will take a while, but this is the first and safest move in this direction
2019-04-30 10:27:42 +10:00
1ab91f0474 FIX: preserve github fragment URL 2018-12-19 12:34:47 +05:30
b6963b8ffb FIX: Ignore OneBox blacklisted domains. 2018-08-27 20:40:55 +02:00
7058205f70 FIX: Broken specs 2018-07-24 12:00:34 -04:00
236243f38a SECURITY: Consider 0.0.0.0 a private IP 2018-07-24 11:16:27 -04:00
142571bba0 Remove use of rescue nil.
* `rescue nil` is a really bad pattern to use in our code base.
  We should rescue errors that we expect the code to throw and
  not rescue everything because we're unsure of what errors the
  code would throw. This would reduce the amount of pain we face
  when debugging why something isn't working as expexted. I've
  been bitten countless of times by errors being swallowed as a
  result during debugging sessions.
2018-04-02 13:52:51 +08:00
0559a4736a FIX: don't double request when downloading a file 2018-02-24 12:35:57 +01:00
b6277e208b FIX: Cookies header didn't have the right format 2018-02-19 12:46:57 +01:00
Sam
fa5880e04f PERF: ability to crawl for titles without extra HEAD req
Also, introduces a much more aggressive timeout for title crawling
and introduces gzip to body that is crawled
2018-01-29 15:40:12 +11:00
Sam
1dd2b51059 remove redundent stubs 2017-10-18 12:10:30 +11:00
8185b8cb06 FEATURE: cache https redirects per hostname
If a hostname does an https redirect we cache that so next
lookup does not incur it.

Also, only rate limit per ip once per final destination

Raise final destination protection to 1000 ip lookups an hour
2017-10-17 16:22:54 +11:00
Sam
70bb2aa426 FEATURE: allow specifying s3 config via globals
This refactors handling of s3 so it can be specified via GlobalSetting

This means that in a multisite environment you can configure s3 uploads
without actual sites knowing credentials in s3

It is a critical setting for situations where assets are mirrored to s3.
2017-10-06 16:20:01 +11:00
5324c01209 FIX: Don't raise an error if reading from URL timeout. 2017-09-27 14:53:22 +08:00
367fb1c524 FIX: Onebox fails on encoded URL.
https://meta.discourse.org/t/onebox-breaks-if-theres-chinese-text-in-url/67364
2017-09-26 18:34:54 +08:00
6cd8203686 FIX: allows onebox to force GET hosts returning wrong headers on HEAD 2017-08-08 11:44:27 +02:00
b059a0f789 extract url escaping to a dedicated class method and improved tests 2017-07-29 22:16:51 +05:30
1fe553873c FIX: preserve fragment identifier when escaping url 2017-07-29 17:22:45 +05:30
b534778f46 FIX: Escape URL before attempting to resolve it. 2017-07-18 10:04:24 +09:00
db485ae0da FIX: Support for skipping redirects on certain domains (like steam) 2017-06-26 15:38:43 -04:00
009f0921dc FEATURE: Whitelist hosts for internal crawling 2017-06-13 12:59:54 -04:00
a3729b51eb FIX: Always allow the host the forum is hosted on 2017-06-12 13:22:51 -04:00
53b95f009f FIX: If HEAD is not supported, try GET. Also set cookies 2017-06-06 13:53:49 -04:00
56f98de7b2 Use webmock to stub external web requests. 2017-05-26 15:19:09 +08:00
f8f1548fd4 Revert "FIX: Use Excon to do its own stubbing"
This reverts commit 80af54460a81bf1593cacb1f0b23e0e0a42c8076.
2017-05-26 13:04:25 +08:00
3b0cbf7013 FIX: Always allow downloads from CDN 2017-05-23 16:32:54 -04:00
b81e7be9a1 FEATURE: Rate limit how often we'll crawl a destination IP 2017-05-23 15:03:04 -04:00
36e477750c FIX: Use same code path for downloading images 2017-05-23 14:51:30 -04:00
e5e7a15a85 SECURITY: Never crawl by IP 2017-05-23 13:07:18 -04:00
93a5fc62bf FEATURE: A site setting to prevent crawling on private IP blocks 2017-05-23 11:56:06 -04:00
80af54460a FIX: Use Excon to do its own stubbing 2017-05-22 18:19:20 -04:00
b51126dd5e FIX: Reset the WebMock after before every test 2017-05-22 17:52:31 -04:00
4c690f7089 Use FinalDestination to ensure public redirects for onebox 2017-05-22 16:42:49 -04:00
b23fc2bf84 Helper to find the final destination for a URL 2017-05-22 15:52:41 -04:00