Developer blog


Version 1.06 released

Henri Ruutinen 18.07.2017 01:20

Changelog:
- Users can now control the web crawler with a new setting: scan depth. When the indexer discovers links from pages defined as seed urls, it will store and eventually index them as well. This setting controls how deep the link discovery is allowed to go.
- Search results from web indexes now contain also the meta description field.
- The content field of web index results is not automatically focused anymore, but users can choose to do this manually in the following manner:
<?php

# if you want to provide a focused version of your ( long ) 
    # indexed data field with highlighted keywords
    
$row["content"] = $pickmybrain->SearchFocuser($row["content"], $query$chars_per_line$max_len); 
    
$query "mykeyword"// your search query
    
$stem_language "fi"// language for stemming ( en | fi )
    
$chars_per_line 90// how many chars before forced linebreak ( <br> )
    
$max_len =  150// how many chars the focused result may contain in total 
?>

Version 1.05 released

Henri Ruutinen 14.07.2017 20:20

Changelog:
- Feature improvement: new sorting/grouping attribute @id (= document id). Results can now be ordered or organized within groups by internal @id attribute. This provides a faster alternative to external user defined attributes when indexed data is naturally ordered in certain manner ( for example, older entries have smaller @id ).
- Internal search feature has also been updated to support the new sorting/grouping attribute.
- Bug fix: when sorting by an external attribute, disabling certain fields did not have any effect.

Version 1.04 released

Henri Ruutinen 01.07.2017 20:54

Changelog:
- New feature: keyword suggestions. Pickmybrain can automatically suggest better search terms, if the provided keywords seem to be mistyped. This feature is based on the double metaphone phonetic algorithm and works best with english language. See API how to implement and control this feature.

Version 1.03 released

Henri Ruutinen 20.06.2017 18:13

Changelog:
- New feature: synonyms. Users have now ability to define a list of synonyms. Each synonym word will be indexed as other defined synonym words as well, which will result in more search results.
- New feature: search specific data fields only. By default Pickmybrain searches all data fields in your search index. Fields can now be excluded from the search by setting their field weights to zero.

Version 1.02 released

Henri Ruutinen 15.06.2017 18:12

Changelog:
- Main sql query parser had bugs, which in certain conditions may have affected user's ability to run the indexer properly. These should be now fixed.
- Improved main sql query error checking for database indexes.
- Testing database index settings may have crashed the indexer. The ext_db_connection file was forcibly required a second time in the test script leading to a situation where the same function was defined twice.
- New feature: include original indexed data with search results. PMBApi has now an ability to return original indexed columns with the search results. This feature can be turned on or off with the web control panel and a method called IncludeOriginalData(true|false).
- If original data is chosen to be included with the search results, the internal search feature will also show all the indexed columns when a result is clicked.
- The sentiment analysis library had an incorrect environment specific filepath definition. This is now fixed.
- If the indexer crashed for an unknown reason, the indexing state was not updated properly and the indexer seemed to continue indexing forever. This is now fixed.
- The indexer now writes a log after every run. The log file is named "log_[index_id].txt" and it can be used to debug various problems in the future.

Sentiment analysis plugin is now included with version 1.01

Henri Ruutinen 24.05.2017 18:41

Changelog:
- Sentiment analysis plugin with english and finnish language packs is now included with Pickmybrain.

First release version 1.00 now commencing

Henri Ruutinen 04.04.2017 23:15

Changelog:
- This is the first release version as the beta phase is now officially over. This version includes many bug fixes and is generally more carefully tested and polished than the previous ones. However, bugs may still remain :-)
- Most significant change/improvement is a custom data charset designed for the external unix sort. Before the data was presented in hexadecimal characters, meaning each byte contained 4 bits of actual information. Now each byte contains 7 bits of information, making the temporary files much smaller and faster to sort. Some other optimizations are in place as well, making the temporary files ~50% smaller in total.
- New feature: disabled documents. As results cannot be removed from an existing search index, documents can be excluded from resultset by defining disabled document ids. The PMBApi now includes two new methods for disabling documents and enabling them again. This feature is index-specific. See more info at API documentation.
- All the data compressor files had bugs resulting in different amount of compressed data depending on script execution method and multiprocessing. These bugs are now fixed.
- Searching is now ( marginally ) faster thanks to careful tuning of the searching algorithm.

Version 0.99 beta released

Henri Ruutinen 21.03.2017 16:34

Changelog:
- The live search feature in the control panel is now improved. Users can now easily play with sort, group and match modes.
- Blend chars is improved. Substitute words will now have the same field position as the original token, which leads to more accurate phrase scoring.
- The search result array now includes all attributes, which are used for sorting or grouping results. If sentiment score is used for sorting or grouping, it will be presented as a separate virtual attribute @sentiscore in the result array.
- Sorting or grouping by attributes did not work for web indexes, this is now fixed.
- In the last version a new bug was introduced. It broke down the extended syntax when user wanted to search for keywords in particular order. This is now fixed.

Version 0.98 beta released

Henri Ruutinen 14.03.2017 20:00

Changelog:
- A small bug was found in the phrase proximity algorithm. While making corrections to the code I was also able to eliminate one busy calculation loop altogether.
- Results from combined main+delta indexes had incorrect bm25 scores, this is now fixed.
- Search indexes with sentiment analysis gave incorrect results in certain situations (in sentiment search mode). Bugs were found in the files token_compressor_ext.php and token_compressor_merger_ext.php which write temporary data in .txt files.
- SetFilterRange()-method was pretty much broken. Not anymore.
- Memory consumption is now lower when groupsorting results by document scores.

Version 0.97 beta released

Henri Ruutinen 09.03.2017 23:25

Changelog:
- Made multiple small improvements to PMBApi's document keyword position decoding sequence. Combined all these changes made a ~10% improvement to the searching performance.
- Also fixed a small bug which affected document scoring when a non-wanted keyword was given.

Version 0.96 beta released

Henri Ruutinen 03.03.2017 21:09

Changelog:
- Major changes made to PMBApi's data decompression algorithm. Before the binary strings containing keyword data were expanded directly into PHP arrays. Since PHP is not very memory-efficient when it comes to arrays, this ramped up the memory consumption unnecessarily high. The new algorithm uses a different approach that decodes only a small part of each binary string simultaneously. The memory consumption is significantly lower and this update is certainly recommended.
- Made some optimizations on many things, the overall performance is now better even if the new algorithm is somewhat more complex.
- Future versions will be optimized further with new clever heuristics.

Problem with downloading should be fixed now

Henri Ruutinen 30.08.2016 12:25

Version 0.95 beta released

Henri Ruutinen 25.08.2016 12:25

Changelog:
- Added a new method for index growing. Merging data into big existing indexes was slow, because the old index had to be read and rewritten again row by row. The new alternative method places the new data into an parallel index instead, which brings indexing time advantages. However, this method is user-configurable and disabled by default.
- Added checks for whether any data was actually tokenized during the first phase of indexing. If multiprocessing was enabled and some of the processes did not tokenize anything, the indexer may have halted.
- Multiple small bug fixes.

Version 0.93 beta released

Henri Ruutinen 24.07.2016 01:07

Changelog:
- Added a method for uninstalling Pickmybrain. This option deletes all the database tables, temporary files and configuration files. The pickmybrain folder has still to be deleted manually. This option exists in web control panel and in clisetup.php
- Added a method for resetting indexing states. If an indexing process has crashed, this option resets all process indicators. This in turn will make it possible to run the indexer again. Ongoing indexing processes will not be stopped.

Version 0.92 beta released

Henri Ruutinen 22.07.2016 17:50

Changelog:
- clisetup.php created the PMBIndexes database table with INNODB-engine, which can cause deadlocks when indexing, especially if multiprocessing is enabled. Now the PMBIndexes table is created with MyISAM-engine. It's actually quite important that PMBIndexes is in fact a MyISAM-table - I'll provide a solution in the upcoming release for this.
- If multiprocessing is used with database indexes, db_tokenizer.php will now provide each sister process a parameter telling from which offset to start indexing documents. Before each process tried to resolve the offset by itself, sometimes resulting in inconsistent data.
- Finally, the indexer.php got a new parameter - replace - which keeps existing data intact until the new indexer run is 100% completed.

Version 0.91 beta released

Henri Ruutinen 20.07.2016 18:32

Changelog:
- Fixed a bug in the data_partitioner.php. This practically stopped indexing from working if exec() script execution method (and linux sort) was chosen and multiprocessing was disabled.
- Linux sort works now with web-crawler indexes too. It will be automatically enabled if exec() script execution method is chosen and it is supported by the hosting environment.
- Finally fixed the indexing state indicator bug in the web control panel. Now the web control panel displays more relevant and correct information when indexing.

Version 0.90 beta released

Henri Ruutinen 18.07.2016 00:32

As the sudden jump in the version number may indicate, this is a rather big update. Pickmybrain 0.90 BETA utilizes linux sort ( if it is available ) for sorting temporary match data. Unlike before, the temporary data is not written into database but into a text file instead. Not only is this already faster, but it also uses less space than the database table + covering index design. And when it comes to sorting the actual data, linux sort blows MySQL away. For big indexes, the sorting time advantage for linux sort can be tenfold or even more. This is only good news, but of course, your web enviroment has to support the exec() function and have the linux sort program preinstalled. For the unlucky ones ( and Windows users ) the older MySQL sort method is still supported and will be supported on the future versions too.

As of result I had to change the compressed inverted index architecture a bit. Bad news is that you have to purge your old search indexes but everything else is good. The new design uses less disk space and actually contains more information about the token positions in the indexed documents. This makes it possible to create new, more sophisticated ranking algorithms in the future if such needs arise. Also the program has to do less work during the actual indexing phase. Token matches are no longer stored as token pairs but as field positions instead which greatly reduces the need for random I/0 during indexing.

This version has also numerous bug fixes. The token_compressor_merger file had a design flaw which led to primary key collision(s) if certain circumstances were met. Also, the prefix_composer lacked a string multibyte definition and had an incorrect start offset for growing indexes which led to inconsistent number of created prefixes if multiprocessing was enabled.

Other changes:
- the indexer.php file has a new parameter: purge ( this removes all previously indexed data from the current index )
- PMB_MATCH_STRICT matching mode will now require that the first provided keyword is a first token of some field for the document to be considered as a match. ( before it wast last provided keyword = last token of some field )

In the future versions I will be concentrating more on the application programming interface's performance.

Version 0.84 beta released

Henri Ruutinen 21.06.2016 03:32

Changelog:
- The problems with memory consumption have now been (mostly) fixed. In earlier versions tokens were kept in the memory during indexing but now they are inserted into a temporary table with an unique index. Match data and tokens are matched later with help of a 48-bit checksum ( or 32-bit and 16-bit checksums to be exact ). Performance did not suffer either, initial tests show promising results.
- Performance upgrade: replaced prepared statements with PDO's quote-method. This method turned out to be twice as fast and there should not be any security implications either.
- The prefix data merging method is now better. Earlier the temporary prefix data was kept and all the prefixes were compressed again when the indexer was run. Now the new data is merged directly with the old data, which greatly reduces the disk impact.
- Temporary data for token matches and prefixes is now removed after indexing is done.
- Added index id into the web-based control panel. Earlier you just had to figure it out :)
- Removed some old, redundant code.

Version 0.83 beta released

Henri Ruutinen 07.06.2016 18:54

Changelog:
- Pickmybrain can now be used from the command-line, see files clisetup.php and clisearch.php
- as a result, configuration files are now text-based and not php files
- Fixed incorrect index definition on the PMBDocinfo table for web-crawler indexes
- The indexed documents counter should now be more accurate on database indexes
- Fixed a bug that prevented dialect processing from working properly when user made changes to the charset
- When indexer is launched, the index state will now be updated quicker to the database

Version 0.82 beta released

Henri Ruutinen 28.05.2016 21:55

Changelog:
- MyISAM temporary tables now use compressed indexes ( ~20-25% less space usage + very very small performance gain )
- Fixed some issues with the external read-only database configuration
- Made a small change to the prefix compressor, now there is less incoming traffic from the database server
- Fixed a bug that caused manually defined attributes textarea to have "2" as default value on new indexes

Version 0.81 beta released

Henri Ruutinen 26.05.2016 20:56

Changelog:
- Fixed a deltadecoding bug, which essentially stopped prefixes from working

Initial release (0.80 Beta)

Henri Ruutinen 22.05.2016 15:47

I decided to release an initial version of Pickmybrain. In no way its a finalized product but it should give some clue about what is to be expected. Pickmybrain is licenced under GPLv3, which means you are free to download, modify and redistribute it as you wish ( as long as you remember to honour the other license conditions :) ).

The beta version has still some quirks. Here is a list of currently known shortcomings:
1. There is no working memory management system in place yet.
2. The temporary table system for token hits and prefixes is flawed in some ways
3. Memory consumption of PMBApi could still be optimized further

Memory Management

The whole keyword dictionary is kept in memory during indexing, which speeds up the operation but requires quite a lot of memory. If multiprocessing is enabled, each process will keep its own dictionary and the memory consumption will add up. Other pitfall is the process (token_compressor.php) in which temporarily stored token matches are fetched from the database, compressed and inserted again into database table PMBTokens as compressed binary strings. The same thing applies also for the prefix compression process (prefix_compressor.php) and the table PMBPrefixes. These two last ones should actually be somewhat easy to fix with simple memory consumption monitoring and write buffer tuning. The first one requires a some sort of compromise between memory usage and performance.

Temporary tables

When keywords are tokenized, they are inserted into a temporary MyISAM table named PMBdatatemp (checksum (int), token_id(mediumint), document_id(int), count(tinyint), token_id_2(mediumint), field_id(tinyint)). The checksum column is a crc32 checksum of the actual token and it is stored because the data needs to be read in an ascending order to ensure decent inserting performance for the final InnoDB table PMBTokens with columns (checksum(int), token(varbinary40), doc_matches(int), doc_ids(mediumblob)) and a composite primary key of (checksum, token). MyISAM tables have great insert performance, especially when the indexes are disabled right after the table creation and enabled after the table has been fully populated with data. Initially the data is in random order, but after the covering index is enabled with the proper ALTER TABLE command, a sorted index is created and it satisfies the whole select query providing good performance. However this method has three downsides:
1. the inserted temporary data is not compressed, thus it takes a lot of space
2. the resulting index takes a lot of space
3. creating the index takes quite a long time, especially for tables that have billions of rows.

One solution could be to write the temporary data in compressed form, since it needs to be compressed at some point. This would use less space and quite possibly the writes would be faster too. The table could, for example, have three columns: checksum, token_id and doc_ids. However, this would create a new problem: how the read the table in a correct order. Creating an index on the checksum column would be trivial, but the real problem is the index would be non-covering. This would be ok if we just would like to fetch the checksums in correct order, but now every entry in the index points to a data row that is at random position in the disk and it is a real dealbreaker for conventional hard disk drives. This pretty much where I got stuck. The current method works though, but I know in my gut that there is a better method, which I haven't discovered yet. If you're reading this and have some wild ideas, don't hesitate to mail me tips or try to solve it yourself ;)

PMBApi and memory consumption

The PMBApi needs to store matching document ids into arrays. If user provides multiple keywords when querying a large search index, it is possible that the resulting array(s) have a lot of items. And as you probably know, PHP arrays take a lot of space. However, I don't feel this is a real dealbreaker and I will personally pay some attention to it in the future releases.

Summary
That is about it. I feel I am probably halfway there creating a suitable multi-purpose search engine purely with PHP and MySQL. As this is the initial release, I would love to have some feedback on the work so far. If you have something to share, don't be shy :) Cheers!

P.S. Remember to upgrade to PHP 7 ( if you haven't already )

How everything started

Henri Ruutinen 21.05.2016 21:20

It has been almost one year since I started this project. The original goal was to create a simple web crawler and a background system for indexing websites without too much hassle. Well, my ambitions grew along the way. Instead of a simple web crawler it actually started to make sense - for me at least - to expand the abilities of the background system, or a search engine as some might say, to support indexing of various databases. Some people would certainly want a more controlled way for indexing data and surely in many situations it is more simple to read the data directly from a database rather than output it as a group of web pages at first. Simplicity in terms of usability was one of my main goals for the project but I personally feel those searching for a very simple and easy to implement solution won't be dissapointed either.

I chose MySQL as a data storage solution for my project. It is widely available on shared web hosting services and it's supported on many other platforms as well. Databases such as MySQL have also built-in methods for data caching, which is really great since I didn't feel like inventing the wheel again. But from the beginning I had no intentions of using the existing full-text search feature. It has proved to be notoriously inconsistent over different MySQL versions and it is not very performant either. The tricky part was of course to create something better with data structures and limitations of MySQL. I started with a traditional relational approach. And for a long time I really struggled to create something that would be versatile ( feature-wise ) and performant at the same time. In the end it just seemed impossible - after many hours of optimization the queries would run reasonably fast when already cached, but the real problem was the caching itself. Not that it would not work, but an inverted index like this with every token hit recorded as its own row simply took too much space and filled the precious buffer bool way too early. It was time to think different.

I was researching on how to compress integers efficiently and ran into an article about variable byte codes. Normally when an integered is stored, the nominal size of the integer is exactly how many bits will be used for storing the actual value. If I would like to store a number two ( 10 in binary ) as an 32-bit integer, it would be stored with 30 leading zeros. Variable byte encoding aims to reduce the number of space consuming leading zeros by dividing the integer into smaller parts ( into bytes for example ). First bit of each part ( or byte ) is reserved for indicating if the number ends at that part. In the case of number two, we would end up with a bit sequence of 10000010. That is already 4 to 1 compression ratio. If we want to encode an array of integers, we can can apply an another trick as well. If the values can be sorted in ascending order, only the first value needs to be stored as a complete number - the following values can be stored as delta values. For example an array with values (1, 100, 150, 230) could be delta encoded into (1, 99, 50, 80). Combined with variable byte encoding, delta encoding provides additional space savings.

This would of course mean moving from the common one column one value approach to variable length binary columns. The relational database model would also need to be completely discarded. But hey, anything that works! Many parts of the program code had to be rewritten and the code base became inherently more complex since compressing, decompressing, writing and reading binary data all need customized functions. At the same time I had a new old idea: for some years I had been developing a library for sentiment analysis and in fact it was already in use in an another project. I had pretty much finished libraries for english and finnish languages but the actual analyzer still lacked something. The weakness of my original approach started to raise its head when analyzed texts started to get long. Normally the writer expresses his/her feelings on many different subjects and just by analyzing the text as whole does not give enough information on which topic the writes likes or dislikes. But then I realized combining a search engine and a sentiment analyzer makes actually a lot of sense. A search engine provides a natural way to tokenize the document and storing the score context of individual tokens is trivial when you've already got the right architecture. So that is what Pickmybrain turned out to be - a combined search engine and sentiment analyzer.