Skip to content

Commit e01d578

Browse files
committed
Percent-decode filenames and use index.html if there's no filename in the URL
Merge https://salsa.debian.org/debian/wcurl/-/merge_requests/4 closes #10 closes #4
2 parents 5ba0a5f + 3e647e2 commit e01d578

File tree

4 files changed

+162
-13
lines changed

4 files changed

+162
-13
lines changed

README.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@ SPDX-License-Identifier: curl
1616

1717
# Synopsis
1818

19-
wcurl [--curl-options <CURL_OPTIONS>]... [--dry-run] [--] <URL>...
20-
wcurl [--curl-options=<CURL_OPTIONS>]... [--dry-run] [--] <URL>...
19+
wcurl [--curl-options <CURL_OPTIONS>]... [--dry-run] [--no-decode-filename] [--] <URL>...
20+
wcurl [--curl-options=<CURL_OPTIONS>]... [--dry-run] [--no-decode-filename] [--] <URL>...
2121
wcurl -V|--version
2222
wcurl -h|--help
2323

@@ -35,14 +35,16 @@ should be using curl directly if your use case is not covered.
3535

3636

3737
* By default, **wcurl** will:
38-
* Encode whitespaces in URLs;
38+
* Percent-encode whitespaces in URLs;
3939
* Download multiple URLs in parallel if the installed curl's version is >= 7.66.0;
4040
* Follow redirects;
4141
* Automatically choose a filename as output;
4242
* Avoid overwriting files if the installed curl's version is >= 7.83.0 (`--no-clobber`);
4343
* Perform retries;
4444
* Set the downloaded file timestamp to the value provided by the server, if available;
4545
* Disable **curl**'s URL globbing parser so `{}` and `[]` characters in URLs are not treated specially.
46+
* Percent-decode the resulting filename.
47+
* Use "index.html" as default filename if there's none in the URL.
4648

4749
# Options
4850

@@ -51,6 +53,9 @@ should be using curl directly if your use case is not covered.
5153

5254
Specify extra options to be passed when invoking curl. May be specified more than once.
5355

56+
* `--no-decode-filename`
57+
Don't percent-decode the output filename, even if the percent-encoding in the URL was done by wcurl, e.g.: The URL contained whitespaces.
58+
5459
* `--dry-run`
5560

5661
Don't actually execute curl, just print what would be invoked.
@@ -66,7 +71,7 @@ should be using curl directly if your use case is not covered.
6671
# Url
6772

6873
Anything which is not a parameter will be considered an URL.
69-
**wcurl** will encode whitespaces and pass that to curl, which will perform the
74+
**wcurl** will percent-encode whitespaces and pass that to curl, which will perform the
7075
parsing of the URL.
7176

7277
# Examples

tests/tests.sh

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,83 @@ testUrlStartingWithDash()
112112
assertEquals "${ret}" "Unknown option: '-example.com'."
113113
}
114114

115+
testUrlDefaultName()
116+
{
117+
url='example%20with%20spaces.com'
118+
ret=$(${WCURL_CMD} ${url} 2>&1)
119+
assertContains "Verify whether 'wcurl' chooses the correct default filename when there's no path in the URL" "${ret}" 'index.html'
120+
}
121+
122+
testUrlDefaultNameTrailingSlash()
123+
{
124+
url='example%20with%20spaces.com/'
125+
ret=$(${WCURL_CMD} ${url} 2>&1)
126+
assertContains "Verify whether 'wcurl' chooses the correct default filename when there's no path in the URL and the URl ends with a slash" "${ret}" 'index.html'
127+
}
128+
129+
testUrlDecodingWhitespaces()
130+
{
131+
url='example.com/filename%20with%20spaces'
132+
ret=$(${WCURL_CMD} ${url} 2>&1)
133+
assertContains "Verify whether 'wcurl' successfully decodes percent-encoded whitespaces in URLs" "${ret}" 'filename with spaces'
134+
}
135+
136+
testUrlDecodingWhitespacesTwoFiles()
137+
{
138+
url='example.com/filename%20with%20spaces'
139+
url_2='example.com/filename2%20with%20spaces'
140+
ret=$(${WCURL_CMD} ${url} ${url_2} 2>&1)
141+
assertContains "Verify whether 'wcurl' successfully decodes percent-encoded whitespaces in URLs" "${ret}" 'filename with spaces'
142+
assertContains "Verify whether 'wcurl' successfully decodes percent-encoded whitespaces in URLs" "${ret}" 'filename2 with spaces'
143+
}
144+
145+
testUrlDecodingDisabled()
146+
{
147+
url='example.com/filename%20with%20spaces'
148+
ret=$(${WCURL_CMD} --no-decode-filename ${url} 2>&1)
149+
assertContains "Verify whether 'wcurl' successfully decodes percent-encoded whitespaces in URLs" "${ret}" 'filename%20with%20spaces'
150+
}
151+
152+
testUrlDecodingWhitespacesQueryString()
153+
{
154+
url='example.com/filename%20with%20spaces?query=string'
155+
ret=$(${WCURL_CMD} ${url} 2>&1)
156+
assertContains "Verify whether 'wcurl' successfully decodes percent-encoded whitespaces in URLs with query strings" "${ret}" 'filename with spaces'
157+
}
158+
159+
testUrlDecodingWhitespacesTrailingSlash()
160+
{
161+
url='example.com/filename%20with%20spaces/'
162+
ret=$(${WCURL_CMD} ${url} 2>&1)
163+
assertContains "Verify whether 'wcurl' successfully uses the default filename when the URL ends with a slash" "${ret}" 'index.html'
164+
}
165+
166+
# Test decoding a bunch of different languages (that don't use the latin
167+
# alphabet), we could split each language on its own test, but for now it
168+
# doesn't make a difference.
169+
testUrlDecodingNonLatinLanguages()
170+
{
171+
# Arabic
172+
url='example.com/%D8%AA%D8%B1%D9%85%D9%8A%D8%B2_%D8%A7%D9%84%D9%86%D8%B3%D8%A8%D8%A9_%D8%A7%D9%84%D9%85%D8%A6%D9%88%D9%8A%D8%A9'
173+
ret=$(${WCURL_CMD} ${url} 2>&1)
174+
assertContains "Verify whether 'wcurl' successfully decodes percent-encoded Arabic in URLs" "${ret}" 'ترميز_النسبة_المئوية'
175+
176+
# Persian
177+
url='example.com/%DA%A9%D8%AF%D8%A8%D9%86%D8%AF%DB%8C_%D8%AF%D8%B1%D8%B5%D8%AF%DB%8C'
178+
ret=$(${WCURL_CMD} ${url} 2>&1)
179+
assertContains "Verify whether 'wcurl' successfully decodes percent-encoded Persian in URLs" "${ret}" 'کدبندی_درصدی'
180+
181+
# Japanese
182+
url='example.com/%E3%83%91%E3%83%BC%E3%82%BB%E3%83%B3%E3%83%88%E3%82%A8%E3%83%B3%E3%82%B3%E3%83%BC%E3%83%87%E3%82%A3%E3%83%B3%E3%82%B0'
183+
ret=$(${WCURL_CMD} ${url} 2>&1)
184+
assertContains "Verify whether 'wcurl' successfully decodes percent-encoded Japanese in URLs" "${ret}" 'パーセントエンコーディング'
185+
186+
# Korean
187+
url='example.com/%ED%8D%BC%EC%84%BC%ED%8A%B8_%EC%9D%B8%EC%BD%94%EB%94%A9'
188+
ret=$(${WCURL_CMD} ${url} 2>&1)
189+
assertContains "Verify whether 'wcurl' successfully decodes percent-encoded Korean in URLs" "${ret}" '퍼센트_인코딩'
190+
}
191+
115192
## Ideas for tests:
116193
##
117194
## - URL with whitespace

wcurl

Lines changed: 68 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ usage()
4848
cat << _EOF_
4949
${PROGRAM_NAME} -- a simple wrapper around curl to easily download files.
5050
51-
Usage: ${PROGRAM_NAME} [--curl-options <CURL_OPTIONS>] [--dry-run] [--] <URL>...
52-
${PROGRAM_NAME} [--curl-options=<CURL_OPTIONS>] [--dry-run] [--] <URL>...
51+
Usage: ${PROGRAM_NAME} [--curl-options <CURL_OPTIONS>] [--dry-run] [--no-decode-filename] [--] <URL>...
52+
${PROGRAM_NAME} [--curl-options=<CURL_OPTIONS>] [--dry-run] [--no-decode-filename] [--] <URL>...
5353
${PROGRAM_NAME} -h|--help
5454
${PROGRAM_NAME} -V|--version
5555
@@ -59,14 +59,19 @@ Options:
5959
passed when invoking curl. May be
6060
specified more than once.
6161
62+
--no-decode-filename: Don't percent-decode the output filename,
63+
even if the percent-encoding in the URL was
64+
done by wcurl, e.g.: The URL contained
65+
whitespaces.
66+
6267
--dry-run: Don't actually execute curl, just print what would be
6368
invoked.
6469
6570
-V,--version: Print version information.
6671
6772
-h,--help: Print this usage message.
6873
69-
<URL>: The URL to be downloaded. May be specified more than once.
74+
<URL>: The URL to be downloaded. May be specified more than once.
7075
_EOF_
7176
}
7277

@@ -93,7 +98,6 @@ readonly PER_URL_PARAMETERS="\
9398
--globoff \
9499
--location \
95100
--proto-default https \
96-
--remote-name-all \
97101
--remote-time \
98102
--retry 10 \
99103
--retry-max-time 10 "
@@ -111,6 +115,56 @@ sanitize()
111115
readonly CURL_OPTIONS URLS DRY_RUN
112116
}
113117

118+
# Indicate via exit code whether the string given in the first parameter
119+
# consists solely of characters from the string given in the second parameter.
120+
# In other words, it returns 0 if the first parameter only contains characters
121+
# from the second parameter, e.g.: Are $1 characters a subset of $2 characters?
122+
is_subset_of()
123+
{
124+
case "${1}" in
125+
*[!${2}]*|'') return 1;;
126+
esac
127+
}
128+
129+
# Print the given string percent-decoded.
130+
percent_decode()
131+
{
132+
# Encodings of control characters (00-1F) are passed through without decoding.
133+
# Iterate on the input character-by-character, decoding it.
134+
printf "%s\n" "${1}" | fold -w1 | while IFS= read -r decode_out; do
135+
# If character is a "%", read the next character as decode_hex1.
136+
if [ "${decode_out}" = % ] && IFS= read -r decode_hex1; then
137+
decode_out="${decode_out}${decode_hex1}"
138+
# If there's one more character, read it as decode_hex2.
139+
if IFS= read -r decode_hex2; then
140+
decode_out="${decode_out}${decode_hex2}"
141+
# Skip decoding if this is a control character (00-1F).
142+
# Skip decoding if DECODE_FILENAME is not "true".
143+
if is_subset_of "${decode_hex1}" "23456789abcdefABCDEF" && \
144+
is_subset_of "${decode_hex2}" "0123456789abcdefABCDEF" && \
145+
[ "${DECODE_FILENAME}" = "true" ]; then
146+
# Use printf to decode it into octal and then decode it to the final format.
147+
decode_out="$(printf "%b" "\\$(printf %o "0x${decode_hex1}${decode_hex2}")")"
148+
fi
149+
fi
150+
fi
151+
printf %s "${decode_out}"
152+
done
153+
}
154+
155+
# Print the percent-decoded filename portion of the given URL.
156+
get_url_filename()
157+
{
158+
# Remove protocol and query string if present.
159+
hostname_and_path="$(printf %s "${1}" | sed -e 's,^[^/]*//,,' -e 's,?.*$,,')"
160+
# If what remains contains a slash, there's a path; return it percent-decoded.
161+
case "${hostname_and_path}" in
162+
# sed to remove everything preceeding the last '/', e.g.: "example/something" becomes "something"
163+
*/*) percent_decode "$(printf %s "${hostname_and_path}" | sed -e 's,^.*/,,')";;
164+
esac
165+
# No slash means there was just a hostname and no path; return empty string.
166+
}
167+
114168
# Execute curl with the list of URLs provided by the user.
115169
exec_curl()
116170
{
@@ -159,8 +213,10 @@ exec_curl()
159213

160214
NEXT_PARAMETER=""
161215
for url in ${URLS}; do
216+
filename="$(get_url_filename "${url}")"
217+
[ -z "${filename}" ] && filename=index.html
162218
# shellcheck disable=SC2086
163-
set -- "$@" ${NEXT_PARAMETER} ${PER_URL_PARAMETERS} ${CURL_HAS_NO_CLOBBER} ${CURL_OPTIONS} "${url}"
219+
set -- "$@" ${NEXT_PARAMETER} ${PER_URL_PARAMETERS} ${CURL_HAS_NO_CLOBBER} ${CURL_OPTIONS} --output "${filename}" "${url}"
164220
NEXT_PARAMETER="--next"
165221
done
166222

@@ -171,6 +227,9 @@ exec_curl()
171227
fi
172228
}
173229

230+
# Default to decoding the output filename
231+
DECODE_FILENAME="true"
232+
174233
# Use "${1-}" in order to avoid errors because of 'set -u'.
175234
while [ -n "${1-}" ]; do
176235
case "${1}" in
@@ -188,6 +247,10 @@ while [ -n "${1-}" ]; do
188247
DRY_RUN="true"
189248
;;
190249

250+
--no-decode-filename)
251+
DECODE_FILENAME="false"
252+
;;
253+
191254
-h|--help)
192255
usage
193256
exit 0

wcurl.1

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@
2727
- a simple wrapper around curl to easily download files.
2828
.SH SYNOPSIS
2929
.nf
30-
\fBwcurl [\-\-curl\-options \fI<CURL_OPTIONS>\fP]... [\-\-dry\-run] [\-\-] \fI<URL>\fP...\fR
31-
\fBwcurl [\-\-curl\-options=\fI<CURL_OPTIONS>\fP]... [\-\-dry\-run] [\-\-] \fI<URL>\fP...\fR
30+
\fBwcurl [\-\-curl\-options \fI<CURL_OPTIONS>\fP]... [\-\-dry\-run] [\-\-no\-decode\-filename] [\-\-] \fI<URL>\fP...\fR
31+
\fBwcurl [\-\-curl\-options=\fI<CURL_OPTIONS>\fP]... [\-\-dry\-run] [\-\-no\-decode\-filename] [\-\-] \fI<URL>\fP...\fR
3232
\fBwcurl \-V|\-\-version\fR
3333
\fBwcurl \-h|\-\-help\fR
3434
.fi
@@ -46,7 +46,7 @@ should be using curl directly if your use case is not covered.
4646
.TP
4747
By default, \fBwcurl\fR will:
4848
.br
49-
\[bu] Encode whitespaces in URLs;
49+
\[bu] Percent-encode whitespaces in URLs;
5050
.br
5151
\[bu] Download multiple URLs in parallel if the installed curl's version is >= 7.66.0;
5252
.br
@@ -63,6 +63,10 @@ By default, \fBwcurl\fR will:
6363
\[bu] Default to the protocol used as https if the URL doesn't contain any;
6464
.br
6565
\[bu] Disable \fBcurl\fR's URL globbing parser so \fB{}\fR and \fB[]\fR characters in URLs are not treated specially.
66+
.br
67+
\[bu] Percent-decode the resulting filename.
68+
.br
69+
\[bu] Use "index.html" as default filename if there's none in the URL.
6670
.SH OPTIONS
6771
.TP
6872
\fB\-\-curl\-options, \-\-curl\-options=\fI<CURL_OPTIONS>\fR...\fR
@@ -81,7 +85,7 @@ Any option supported by curl can be set here.
8185
This is not used by \fBwcurl\fR; it's instead forwarded to the curl invocation.
8286
.SH URL
8387
Anything which is not a parameter will be considered an URL.
84-
\fBwcurl\fR will encode whitespaces and pass that to curl, which will perform the
88+
\fBwcurl\fR will percent-encode whitespaces and pass that to curl, which will perform the
8589
parsing of the URL.
8690
.SH EXAMPLES
8791
Download a single file:

0 commit comments

Comments
 (0)