Suppose you want to monitor the Internet for certain keywords such as your username, email or some text pattern relevant for you or your company. A good starting point is to monitor Pastebin for those keywords, since that is where most of the interesting stuff gets dumped.
What I wanted was simple:
- detect new pastes that match a given pattern
- save matching pastes
- do some processing on those pastes
- notify me with the results
You can register alarms on Pastebin for up to 3 keywords and get notified upon matching, but you cannot use regular expressions or exclude keywords. Also, mail notifications are not the easiest to handle automatically.
The classical choice was to use Pastemon, maybe the first software of its kind. I had two problems with it: it would not save pastes to file and it was no longer maintained.
Searching for alternatives I found Pystemon, a newer and improved version of Pastemon.
So I ended up building the following setup:
1. setup Pystemon to download and store any pastes that match my rules.
- get (my version of) Pystemon
git clone https://github.com/tmendo/pystemon.git
- edit pystemon.yaml
- set the archive dir
- enable and configure email alerts
- enable the use proxy option
- define some search patterns, for instance
- run it with
python pystemon.py -v -c pystemon.yaml
- search: '[^a-zA-Z0-9]example.(com|org)' description: 'example domains' - search: 'tmendo' description: 'tmendo' exclude: 'tmendoza'
A few mistakes and Pastebin blocked my IP. Later they unblocked it and told me the threshold to stay safe, but in the meantime I have decided to proxy the requests through TOR. Since TOR only exposes a SOCKS proxy and Pystemon only supports HTTP proxies I used DeleGate.
Pystemon was originally designed to cycle through multiple proxies removing those that fail. I needed it to stick with the TOR proxy even if it eventually failed sometimes, so I forked the project, added support for a single proxy without removals, fixed some TODOs, included a few fixes from another committer, and did a pull request.
2. setup DeleGate to convert an SOCKS proxy to an HTTP proxy
Because TOR exposes a SOCKS proxy and Pystemon only supports HTTP proxy I am running DeleGate to pass requests between them.
- download the source
tar zxvf delegate9.9.13.tar.gz
- create dg.conf
-P127.0.0.1:8080 -Tx -fv SOCKS=127.0.0.1:9050 DGROOT="/path/to/delegate/" SERVER=http REMITTABLE="http,https/443" CACHE=no TIMEOUT="shutout:30m" MAXIMA="randstack:32" MAXIMA="randenv:1024" MAXIMA="randfd:32" [email protected]
- execute with
3. setup TOR following these instructions.
This will install and start the daemon with the default configuration that generates new circuits every 10 minutes (you get a new IP every 10 minutes, at least). Pastebin does not discriminate TOR exit nodes, not sure about the other paste sites.
Test the DeleGate and TOR combination:
telnet localhost 8080 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. GET / HTTP/1.1 host: mendo.pt HTTP/1.1 301 Moved Permanently DeleGate-Ver: 9.9.13 (delay=9) Server: nginx Date: Fri, 27 Mar 2015 14:38:04 GMT Content-Type: text/html Location: https://mendo.pt/ Strict-Transport-Security: max-age=31536000; includeSubDomains Via: 1.1 - (DeleGate/9.9.13) Connection: close Content-Length: 178
4. setup a script that parses the stored pastes and do something useful.
You can use inotifywait to detect new pastes downloaded by Pystemon and do whatever you like with them (run it in a sandbox, do some further string matching, etc.)
#!/bin/bash is_dir_pattern=\bISDIR\b inotifywait -r -m /path/to/pystemon/output -e close_write | while read path action file; do if [[ ! "$action" =~ $is_dir_pattern ]]; then echo "The file '$file' appeared in directory '$path' via '$action'" # process $path/$file # notify me of some finding fi done
Listening to create and moved_to events might result in you being warned of a new file before its contents are written so you would end up with an empty file to process. Use close_write instead.
And thats it. Put everything under DJB Daemontools and you are able to monitor some paste sites for whatever you like and whatever reason.