Adrian Malacoda
|
05c766011f
|
style and convention fixes to make pylint happy
|
2016-12-16 00:29:59 -06:00 |
|
Adrian Malacoda
|
1db0d315b8
|
add lint and pyinstaller targets
|
2016-12-15 23:49:42 -06:00 |
|
Adrian Malacoda
|
e574207656
|
move venv logic to makefile
|
2016-12-15 23:14:42 -06:00 |
|
Adrian Malacoda
|
9765675925
|
use python3 -m venv to create virtualenv
|
2016-11-28 03:07:31 -06:00 |
|
Adrian Malacoda
|
93db0de79f
|
Merge branch 'master' of gitlab.monarch-pass.net:malacoda/the-great-escape
|
2016-11-28 02:59:16 -06:00 |
|
Adrian Malacoda
|
8c1f8a7887
|
add a script that can set up virtualenv if it doesn't exist, and run tge in virtualenv
|
2016-11-28 02:58:50 -06:00 |
|
Adrian Malacoda
|
e93f2ba574
|
should probably sanitize category title as well
|
2016-11-27 21:15:39 -06:00 |
|
Adrian Malacoda
|
347a50bf6e
|
need to quote key
|
2016-11-27 21:14:49 -06:00 |
|
Adrian Malacoda
|
77775ae0be
|
need to pass url to scrape_board_from_document
|
2016-11-27 17:52:47 -06:00 |
|
Adrian Malacoda
|
acdf659e4a
|
dry up pagination logic using a generator
|
2016-11-27 17:50:25 -06:00 |
|
Adrian Malacoda
|
4cd9b22eb9
|
Use a loop to iterate thread/board pages, not recursion. For large threads this can cause a stack overflow. Also, since we're no longer doing the http request in the same function that does the scraping, we need to limit the @retry to the function that actually does the http call as that's what we want to be retrying.
|
2016-11-27 17:42:46 -06:00 |
|
Adrian Malacoda
|
b67ab06b55
|
Add exponential backoff for retrying
|
2016-11-27 13:03:13 -06:00 |
|
Adrian Malacoda
|
f47895fd46
|
add retrying dependency
|
2016-11-27 13:01:20 -06:00 |
|
Adrian Malacoda
|
83802088bf
|
url -> in
|
2016-11-27 02:07:21 -06:00 |
|
Adrian Malacoda
|
6e184478e0
|
commas and quotes too
|
2016-11-27 02:06:38 -06:00 |
|
Adrian Malacoda
|
a517f5c28c
|
sanitize some more characters. Not all of these might be unsafe but some are at least weird looking in filenames
|
2016-11-27 02:06:07 -06:00 |
|
Adrian Malacoda
|
bb19533df6
|
__sanitize_title -> sanitize_title
|
2016-11-27 02:04:54 -06:00 |
|
Adrian Malacoda
|
c52f472091
|
redo main() so it can work with either the local filesystem or urls. rename --url to --in for consistency.
|
2016-11-27 02:04:44 -06:00 |
|
Adrian Malacoda
|
808677b327
|
split the sanitize_title into a util module
|
2016-11-27 02:04:20 -06:00 |
|
Adrian Malacoda
|
ea6b0ddb1f
|
need to write in binary mode
|
2016-11-27 01:58:47 -06:00 |
|
Adrian Malacoda
|
aae663d518
|
add pickle scraper which can load serialized pickle files
|
2016-11-27 01:58:39 -06:00 |
|
Adrian Malacoda
|
dda2f183e3
|
if can_scrape_url isn't available don't call it
|
2016-11-27 01:58:25 -06:00 |
|
Adrian Malacoda
|
30aeead404
|
add import
|
2016-11-27 01:52:45 -06:00 |
|
Adrian Malacoda
|
43f1f6a680
|
add pickle outputter
|
2016-11-27 01:42:59 -06:00 |
|
Adrian Malacoda
|
ac76474030
|
need to clean up title by removing certain characters
|
2016-11-27 01:41:17 -06:00 |
|
Adrian Malacoda
|
2e93aaa9bc
|
want to output board description
|
2016-11-27 01:21:07 -06:00 |
|
Adrian Malacoda
|
d54f3ec21c
|
there's multiple h1's on the page and the one we want is like .eq(2) or something. But once you start addressing nodes by index like that you get real brittle and can break easily. I don't think we have a problem with just selecting all h1's here.
|
2016-11-27 01:19:59 -06:00 |
|
Adrian Malacoda
|
39b8bfff30
|
grab board description from forum index (we can't get it from the board index)
|
2016-11-27 01:13:45 -06:00 |
|
Adrian Malacoda
|
3f4eecc238
|
use dateutil to parse rfc3339 datetime strings in <time> elements, if they are present.
|
2016-11-27 01:10:04 -06:00 |
|
Adrian Malacoda
|
c800312423
|
add python-dateutil dep
|
2016-11-27 01:01:52 -06:00 |
|
Adrian Malacoda
|
5bcb6e8884
|
add extra post & user info
|
2016-11-27 00:48:55 -06:00 |
|
Adrian Malacoda
|
71a4f8c5a4
|
add extra user metadata such as title and avatar
|
2016-11-27 00:43:11 -06:00 |
|
Adrian Malacoda
|
de89ddb350
|
Add timestamp to post model
|
2016-11-27 00:42:27 -06:00 |
|
Adrian Malacoda
|
61e25fe9d9
|
example of large thread
|
2016-11-27 00:34:59 -06:00 |
|
Adrian Malacoda
|
c83d4a9916
|
for now, limit to forumer forums (fr.yuku.com) as I'm not sure if this scraper will support non-forumer ones
|
2016-11-27 00:18:39 -06:00 |
|
Adrian Malacoda
|
55176e4596
|
more examples in readme
|
2016-11-27 00:17:04 -06:00 |
|
Adrian Malacoda
|
741573d30a
|
only want first h1/h2 etc
|
2016-11-27 00:16:21 -06:00 |
|
Adrian Malacoda
|
ea46ae8853
|
.text() not text
|
2016-11-27 00:14:16 -06:00 |
|
Adrian Malacoda
|
9c401cbfb1
|
need to use .items() grumble grumble
|
2016-11-27 00:11:42 -06:00 |
|
Adrian Malacoda
|
b304297019
|
fix signature parsing, use html instead of text. Unfortunately there's a lot of garbage here we'll have to clean up
|
2016-11-27 00:03:30 -06:00 |
|
Adrian Malacoda
|
6fb7980218
|
make threads subdir under board so we can put an index.json there with board metadata
|
2016-11-26 23:54:05 -06:00 |
|
Adrian Malacoda
|
eabf099f47
|
fix for yuku's broken postbit markup
|
2016-11-26 23:42:30 -06:00 |
|
Adrian Malacoda
|
f4540d4030
|
expand readme
|
2016-11-26 23:16:06 -06:00 |
|
Adrian Malacoda
|
c04c030540
|
add user object
|
2016-11-26 23:14:09 -06:00 |
|
Adrian Malacoda
|
933e178ce5
|
initial commit for the-great-escape yuku scraper
|
2016-11-26 23:09:12 -06:00 |
|
Adrian Malacoda
|
e5fb7e5c9a
|
initial commit
|
2016-11-26 21:02:28 -06:00 |
|