25 Commits

Author SHA1 Message Date
Adrian Malacoda
bb19533df6 __sanitize_title -> sanitize_title 2016-11-27 02:04:54 -06:00
Adrian Malacoda
c52f472091 redo main() so it can work with either the local filesystem or urls. rename --url to --in for consistency. 2016-11-27 02:04:44 -06:00
Adrian Malacoda
808677b327 split the sanitize_title into a util module 2016-11-27 02:04:20 -06:00
Adrian Malacoda
ea6b0ddb1f need to write in binary mode 2016-11-27 01:58:47 -06:00
Adrian Malacoda
aae663d518 add pickle scraper which can load serialized pickle files 2016-11-27 01:58:39 -06:00
Adrian Malacoda
dda2f183e3 if can_scrape_url isn't available don't call it 2016-11-27 01:58:25 -06:00
Adrian Malacoda
30aeead404 add import 2016-11-27 01:52:45 -06:00
Adrian Malacoda
43f1f6a680 add pickle outputter 2016-11-27 01:42:59 -06:00
Adrian Malacoda
ac76474030 need to clean up title by removing certain characters 2016-11-27 01:41:17 -06:00
Adrian Malacoda
2e93aaa9bc want to output board description 2016-11-27 01:21:07 -06:00
Adrian Malacoda
d54f3ec21c there's multiple h1's on the page and the one we want is like .eq(2) or something. But once you start addressing nodes by index like that you get real brittle and can break easily. I don't think we have a problem with just selecting all h1's here. 2016-11-27 01:19:59 -06:00
Adrian Malacoda
39b8bfff30 grab board description from forum index (we can't get it from the board index) 2016-11-27 01:13:45 -06:00
Adrian Malacoda
3f4eecc238 use dateutil to parse rfc3339 datetime strings in <time> elements, if they are present. 2016-11-27 01:10:04 -06:00
Adrian Malacoda
5bcb6e8884 add extra post & user info 2016-11-27 00:48:55 -06:00
Adrian Malacoda
71a4f8c5a4 add extra user metadata such as title and avatar 2016-11-27 00:43:11 -06:00
Adrian Malacoda
de89ddb350 Add timestamp to post model 2016-11-27 00:42:27 -06:00
Adrian Malacoda
c83d4a9916 for now, limit to forumer forums (fr.yuku.com) as I'm not sure if this scraper will support non-forumer ones 2016-11-27 00:18:39 -06:00
Adrian Malacoda
741573d30a only want first h1/h2 etc 2016-11-27 00:16:21 -06:00
Adrian Malacoda
ea46ae8853 .text() not text 2016-11-27 00:14:16 -06:00
Adrian Malacoda
9c401cbfb1 need to use .items() grumble grumble 2016-11-27 00:11:42 -06:00
Adrian Malacoda
b304297019 fix signature parsing, use html instead of text. Unfortunately there's a lot of garbage here we'll have to clean up 2016-11-27 00:03:30 -06:00
Adrian Malacoda
6fb7980218 make threads subdir under board so we can put an index.json there with board metadata 2016-11-26 23:54:05 -06:00
Adrian Malacoda
eabf099f47 fix for yuku's broken postbit markup 2016-11-26 23:42:30 -06:00
Adrian Malacoda
c04c030540 add user object 2016-11-26 23:14:09 -06:00
Adrian Malacoda
933e178ce5 initial commit for the-great-escape yuku scraper 2016-11-26 23:09:12 -06:00